{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Generative adversarial nets , Authors: Ian J. Goodfellow Jean Pouget-Abadie... , 2014\n2.  Title: Explaining and harnessing adversarial examples , Authors: Ian J. Goodfellow Jonathon Shlens... , 2014\n3.  Title: Do deep generative models know what they don\u2019t know , Authors: Eric Nalisnick Akihiro Matsukawa... , 2019\n4.  Title: Detecting out-of-distribution inputs to deep generative models using a test for typicality , Authors: Eric Nalisnick Akihiro Matsukawa... , 2019\n5.  Title: Generative probabilistic novelty detection with adversarial autoencoders , Authors: Stanislav Pidhorskyi Ranya Almohsen... , 2018\n6.  Title: Deep one-class classification , Authors: Lukas Ruff Robert Vandermeulen... , 2018\n7.  Title: Deep semi-supervised anomaly detection , Authors: Lukas Ruff Robert Vandermeulen... , 1906\n8.  Title: Adversarially learned one-class classifier for novelty detection , Authors: Mohammad Sabokrou Mohammad Khalooei... , 2018\n9.  Title: Efficient GAN-based anomaly detection , Authors: Houssam Zenati Chuan Sheng Foo... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) have been observed to assign lower uncertainty to the OOD than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient at test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns better target distribution representation than the baseline RND algorithm. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy on the CelebA dataset.", "The goal of OOD detection is to determine whether the data is sampled from the target distribution D. Therefore, based on the training data Dtrain \u0102 D, we train a scalar function that expresses the\nconfidence, or uncertainty of the data. The performance of the OOD detector is tested on the Dtest \u0102 D against the OOD dataset DOOD. We denote an in-distribution data and OOD pair as In : Out in this paper, e.g., CIFAR-10 : SVHN. In this section, we mention only closely related work to our research. For a broader survey on deep OOD detection, we recommend the paper from Chalapathy et al. (2019). OOD Detection: Majority of OOD detection methods rely on a reconstruction error and representation learning. (Ruff et al., 2018) trained a deep neural network to map data into a minimum volume hypersphere. Generative probabilistic novelty detection (GPND) (Pidhorskyi et al., 2018) employed the distance to the latent data manifold as the confidence measure and trained the adversarial autoencoder (AAE) to model the manifold. Deep generative models are widely employed for latent space modeling in OOD detection (Zenati et al., 2018; Sabokrou et al., 2018). However, a recent paper by Nalisnick et al. (2019) discovered that popular deep generative models, such as variational autoencoder (VAE) (Kingma et al., 2014) or GLOW (Kingma et al., 2018), fail to detect simple OOD. While adversarially trained generative models, such as generative adversarial networks (GAN) (Goodfellow et al., 2014) or AAE, are not discussed in Nalisnick et al. (2019), our experiments with GPND show that such models can also struggle to detect such simple OODs. OOD Detection with Additional Data: Some methods try to solve OOD detection by appending additional data or labels for training.", "Summary: They tackle the problem of out-of-data distribution by leveraging RND applied to data augmentations. They train a model f(x) to match the outputs of g_i(aug_i(x)), where g_i is a random network and aug_i is a particular type of augmentation. An example with high error in this task is treated as an out-of-distribution example. This work focuses on exploring blurring through SVD, where the smallest K singular values are set to 0, and K varies between different aug_i calls. They find that their method of consistently can achieve strong detection rates across multiple target-dataset pairs.\n\nComments:\n* The experimental results in this work are impressive, which introduces many more questions.\n* The model used for f and g is not mentioned in the text.\n* Figure 4 (left) suggests that the SVD-RND performs about the same between 10K and 50K examples. The level of robustness is surprising, but doesn\u2019t seem to square with intuition that more data ought to help. How little data can be used? In other words, extend the graph to the left. \n* The geometric transforms baseline is not fair, since SVD-RND uses multiple SVD transforms (b_train > 1) whereas the geometric transforms only have one. Please run a model with all the geometric transforms. This result is important for understanding whether the gains come from the particular transform (SVD) or the number of transforms used.\n* Following the spirit of the previous comment, what other data augmentations can be used in place of SVD? Typical image classification pipelines use a large variety of augmentations. I would suggest taking some augmentations from AutoAugment [1] and running RND on top of them.\n* An experiment that is missing is RND trained on blurred images. Is the blurring itself the important component, or is having multiple different heads important?\n* In general, I am confused about how a single head RND does not converge to 0 loss by learning the weights of g. This seems to be a simple optimization problem. The original RND paper avoided this problem by also using the network to learn a policy, but this does not exist in this approach.\n\nIn general, this work seems promising, but lacks proper ablations that elucidate what components of the method are important. I am happy to increase my score if the experiments suggests are added to the work.\n\n6: Weak Accept"], "input": "1 INTRODUCTION\nOut-of distribution (OOD) or novelty detection aims to distinguish samples in unseen distribution from the training distribution. A majority of novelty detection methods focus on noise filtering or representation learning. For example, we train an autoencoder to learn a mapping from the data to the bottleneck layer and use the bottleneck representation or reconstruction error to detect the OOD (Sakruada et al., 2014; Pidhorskyi et al., 2018). Recently, deep generative models (Kingma et al., 2014; Dinh et al., 2017; Goodfellow et al., 2014; Kingma et al., 2018; Schlegl et al., 2017) are widely used for novelty detection due to their ability to model high dimensional data. However, OOD detection performance of deep generative models has been called into question since they have been observed to assign a higher likelihood to the OOD data than the training data (Nalisnick et al., 2019; Choi et al., 2018).\nOn the other hand, adversarial examples are widely employed to fool the classifier, and training classifiers against adversarial attacks has shown effectiveness in detecting unknown adversarial attacks (Tramer et al., 2018). In this work, we propose blurred data as adversarial examples. When we test novelty detection schemes on the blurred data generated by Singular Value Decomposition (SVD), we found that the novelty detection schemes assign higher confidence to the blurred data than the original data.\nMotivated by this observation, we employ blurring to prevent the OOD detector from overfitting to low resolution. We propose a new OOD detection model, SVD-RND, which is trained using the idea of Random Network Distillation (RND) (Burda et al., 2019) to discriminate the training data from their blurred versions. SVD-RND is evaluated in some challenging scenarios where vanilla generative models show nearly 50% detection accuracy, such as CIFAR-10 vs. SVHN and ImageNet vs. CIFAR-10 (Nalisnick et al., 2019). Compared to conventional baselines, SVD-RND shows a significant performance gain from 50% to over 90% in these domains. Moreover, SVD-RND shows improvements over baselines on domains where conventional OOD detection schemes show moderate results, such as CIFAR-10 vs. LSUN. 3 METHODOLOGY\nIn this section, we motivate our use of blurred data as adversarial examples to conventional deep OOD detection methods. Motivatied by the observation, we present our proposed algorithm, SVD-RND, and provide intuitions why SVD-RND help OOD detection. 3.1 GENERATING BLURRED DATA\nIn this work, blurred images function as adversarial examples. We directly employ the SVD on the data matrix of each image in the training data and force the bottom non-zero singular values to zero to construct a blurred image. Suppose that data image d P D consists of multiple channels, where the j-th channel has Nj nonzero singular values \u03c3j1 \u011b \u03c3j2 \u011b . . . \u03c3jNj \u0105 0. Then, the j-th channel can be represented as the weighted sum of orthonormal vectors.\ndj \u201c \u03a3 Nj t\u201c1\u03c3jtujtv T jt (1)\nWe discard the bottom K non-zero singular values of each channel to construct the blurred image. We test conventional novelty detection methods on blurred images. We first train the VQ-VAE (Oord et al., 2017) in the CIFAR-10 (Krizhevsky et al., 2009) dataset. Figure 1 shows the loss of VQ-VAE on the test data and blurred test data (K \u201c 28). We follow the settings of the original paper. VQ-VAE assigns higher likelihood to the blurred data than the original data.\nWe note that this phenomenon is not constrained to the generative models. We trained the RND on the CIFAR-10 dataset and plot the L2 loss in the test data and blurred test data in Figure 1. We refer to Appendix B for detailed explanation and employed architecture for the RND in the experiment. Furthermore, we plot the average loss on the blurred test data and original test data during the training procedure. Throughout the training phase, the model assigns lower uncertainty to the blurred data. This trend is similar to the CIFAR-10 : SVHN case observed by Nalisnick et al. (2019), where the generative model assigns more confidence to the OOD data throughout the training process.\nWhile we employ SVD for our main blurring technique, conventional techniques in image processing can be applied for blurring, such as Discrete Cosine Transform (DCT) or Gaussian Blurring. However, DCT is quadratic in the size of the hyperparameter search space, therefore much harder to optimize than SVD. We further compare the performance between SVD and other blurring techniques in Section 4. 3.2 OOD DETECTION VIA SVD-RND\nWe now present our proposed algorithm, SVD-RND. SVD-RND trains the predictor network f to discriminate between the original and blurred datasets. We first generate blurred datasets Di from Dtrain by zeroing the bottom Ki non-zero singular values of each data channel (i \u201c 1, . . . , btrain, where btrain is the number of generated blurred datasets used for training). We then construct btrain`1 target networks, i.e., g0, g1, . . . , gbtrain . The target networks are independently randomly initialized and their parameters are unchanged during training. Predictor network f is trained to minimize the l2\nloss against the corresponding target network on each dataset.\nf\u02da \u201c arg min f\n\u201d\n\u03a3xPDtrain }fpxq \u00b4 g0pxq} 2 2 ` \u03a3 btrain i\u201c1 \u03a3xPDi }fpxq \u00b4 gipxq} 2 2\n\u0131\n(2)\nSVD-RND optimizes the predictor network f as shown in p2q. When a new test sample x is given, SVD-RND outputs }fpxq \u00b4 g0pxq}22 as the uncertainty of the sample. Figure 2 shows the training process of SVD-RND. While the original RND paper employs a single target network to train the predictor network, SVD-RND employs multiple target networks to discriminate the original data from the blurred images.\nWhile SVD-RND directly regularizes only on the blurred images, OODs can be generated in alternative directions. For completeness, we investigate the performance of conventional models in OODs generated in orthogonal direction to blurring. We refer to Appendix D for the detailed experiment. 4 EXPERIMENTAL RESULTS\n 4.1 EXPERIMENT SETTING\nSVD-RND is examined in the cases in Table 1. CIFAR-10 : SVHN, CelebA (Liu et al., 2015) : SVHN (Netzer et al., 2011), and TinyImageNet (Deng et al., 2009) : (SVHN, CIFAR-10, CIFAR-100) are the cases studied by Nalisnick et al. (2019). We also studied CIFAR-10 : (LSUN (Yu et al., 2015), TinyImageNet), LSUN : (SVHN, CIFAR-10, CIFAR-100) and CelebA: (CIFAR-10, CIFAR-100) pairs. We implement the baselines and SVD-RND in the PyTorch framework.1 For a unified treatment, we resized all images in all datasets to 32\u02c632. We refer to Appendix C for the detailed setting. For SVD-RND, we optimize the number of discarded singular values over different datasets. We choose the detector with the best performance across the validation data. We refer to Appendix C for the parameter setting. We also examine the case where each image is blurred by DCT and Gaussian blurring. For DCT, we apply the DCT to the image, discard low magnitude signals, and generate the blurred image by inverse DCT. In DCT-RND, we optimize the number of discarded components in the frequency domain. For Gaussian blurring, we optimize the shape of the Gaussian kernel. We denote this method as GB-RND.\nWe compare the performance of SVD-RND, DCT-RND, and GB-RND to the following baselines.\nGenerative Probabilistic Novelty Detector: GPND (Pidhorskyi et al., 2018) is the conventional generative-model-based novelty detection method that models uncertainty as a deviation from the latent representation, which is modeled by the adversarial autoencoder. We trained GPND with further parameter optimization.\nGeometric Transforms: Golan et al. (2018) trained a classifier to discriminate in-distribution data against geometrically transformed data to achieve better OOD detection performance. The authors used four types of geometric transforms: flip, rotation, vertical translation, and horizontal translation. We test each transformation by setting them as OOD proxies in the RND framework. Moreover, we also investigate the effect of pixel inversion, contrast reduction, and shearing. We refer to Cubuk et al. (2019) for detailed explanation of the augmentation strategies.\nRND: We test the original RND (Burda et al., 2019) baseline.\n1Our code is based on https://github.com/kuangliu/pytorch-cifar\nTypicality Test: Nalisnick et al. (2019) set the OOD metric of the generative model as the distance between the mean log likelihood of the model on the training data and the log likelihood of the model on the test data. We experiment typicality test on the RND framework by employing the test loss of RND instead of log likelihood in the generative models.\nFive metrics on binary hypothesis testing are used to evaluate the OOD detectors: area under the Receiver Operating Characteristic curve (AUROC), area of the region under the Precision-Recall curve (AUPR), detection accuracy, and TNR (True negative rate) at 95% TPR (True positive rate). All criteria are bounded between 0 and 1, and the results close to 1 imply better OOD detection. 4.2 OOD DETECTION RESULTS\nWe summarize our results on the TNR at 95% TPR in Table 2. For example, TNR of 96.9% is achieved by SVD-RND for CIFAR-10 : SVHN pair. We refer to Appendix A for the full results. In all In : Out domains except the CelebA : (CIFAR-10, CIFAR-100) domain, SVD-RND outperforms all other baselines in every metric. Furthermore, all the proposed techniques outperform GPND and RND on all In : Out domains. We further visualize the CIFAR-10 data before and after blurring in Appendix E. We plot the performance of SVD-RND over different K1 in Figure 3. In Figure 3, we experimented with 4 seeds. In the best performing parameter for each OOD data, SVD-RND shows stable performance. See Appendix F for results under small K1.\nFurthermore, we plot the output of SVD-RND on target CIFAR-10 data and OOD SVHN data when K1 \u201c 28 in Figure 3. Compared to the results in Figure 1, SVD-RND better discriminates SVHN data from the in-distribution data.\nGPND and RND fail to discriminate OOD from the targets in CIFAR-10 : SVHN, LSUN : (SVHN, CIFAR-10, CIFAR-100), TinyImageNet : (SVHN, CIFAR-10, CIFAR-100), and CelebA : SVHN domains. Moreover, GPND performs the SVD of the Jacobian matrix in test time, which makes GPND slower than SVD-RND. Furthermore, we visualize the uncertainty prediction of RND and SVD-RND in Figure 4, which shows the top-9 examples on CIFAR-10 test data, where SVD-RND and RND assign the highest uncertainty. We observe that SVD-RND tends to assign higher uncertainty to blurry or hardly recognizable images compared to RND.\nOn the other hand, OOD detection schemes based on geometric transformations (Golan et al., 2018) show generally improved results against GPND and RND on detecting OOD data compared to RND and GPND. Especially in CelebA : (SVHN, CIFAR-10, CIFAR-100) domains, rotation-based methods and translation-based methods show excellent performance. However, in the CIFAR-10 target domain, OOD detection schemes based on geometric transformations show degraded performance against RND or GPND on LSUN and TinyImageNet OOD data.\nFurthermore, typicality test shows mixed results compared to the baseline algorithms.\nFinally, we also investigate the case where limited training data is available. We examined the performance of SVD-RND and RND in CIFAR-10 : (LSUN, TinyImageNet) domains. Figure 4 shows the TNR at 95% TPR metric of each method when the number of training examples is reduced. For each OOD data, we denote the result on SVD-RND as OOD SVD, and denote the result on RND as OOD RND. 5 FURTHER ANALYSIS\nIn this section, we examine some other aspects of SVD-RND. In Section 5.1, we examine whether SVD-RND learns better representation compared to the baseline. Furthermore, we propose a novel heuristic for training SVD-RND in Section 5.2, where no validation OOD data is available. Finally, we show that we can further improve the performance of SVD-RND by incorporating geometric transformations in Section 5.3. 5.1 REPRESENTATION LEARNING IN SVD-RND\nWhile SVD-RND outperforms RND on every In : Out domains in Section 4, we provide further evidence that SVD-RND learns superior target distribution representation compared to RND. For the evidence, we fine-tune a classifier over the fixed activation map of SVD-RND and RND. We set the activation map as the output of the first 15 or 27 layers of RND and SVD-RND predictor network trained in CIFAR-10 datasets. For the fine-tuning, we either appended three residual blocks and a linear output layer with softmax activation (denoted as 7-layer in Table 3) or a linear layer (denoted as linear in Table 3). Then, we fine-tune the appended network for the CIFAR-10 classification task. The SGD optimizer with learning rate 0.1 is used for fine-tuning, and the learning rate is annealed from 0.1 to 0.01 and 0.001 after 30 and 60 epochs over 100 epochs of training, respectively. We average the result across three fixed random seeds.\nWe show our results in Table 3. SVD-RND consistently outperforms RND on the fine-tuning task. Therefore, the result supports that SVD-RND learns better target distribution-specific knowledge. 5.2 SVD-RND WITHOUT OOD VALIDATION DATA\nIn our main experiments in Section 4, we used the OOD validation data for optimizing each Ki. However, in realistic scenarios, OOD data are generally unknown to the detector. We propose an effective rank (Roy et al., 2007) based design of SVD-RND that does not use the OOD validation dataset and compare its performance against the results in Section 4. Log effective rank of the single image matrixD is defined as the entropy of the normalized singular values p\u03c31, . . . , \u03c3N q of the image matrix.\nLERD \u201c H \u02dc\n\u03c31 \u03a3Nj\u201c1\u03c3j , . . . , \u03c3N \u03a3Nj\u201c1\u03c3j\n\u00b8\n(3)\nIn p3q, H is the entropy function. Then, the effective rank is defined as the two to the power of the log effective rank. We set the effective rank of image data as the averaged effective rank of each channel.\nBased on p3q, we propose selecting each Ki such that average of log effective rank on each blurred dataset is equally spaced. Specifically, suppose the log effective rank of the data averaged in training dataset Dtrain is LERDtrain . Then, we set the target log effective rank LER1,LER2, . . . ,LERbtrain as follows.\nLERi \u201c \u02c6 0.5` 0.5\u02c6 i\u00b4 1 btrain \u02d9 LERDtrain (4)\nThen, we select Ki such that the average of the log effective rank in the blurred dataset with Ki discarded singular values is closest to LERi. We test our criterion in CIFAR-10 and TinyImageNet data with different btrain. We train SVD-RND for 25 epochs for btrain \u201c 3, and 20 epochs for btrain \u201c 4. We show the performance of SVD-RND based on p4q in Table 4, which is denoted as SVD-RND (uniform). We also show results of SVD-RND optimized with the validation OOD data from Table 2 and denote them as SVD-RND (optimized) in Table 4. Uniform SVD-RND already outperforms the second-best methods in Table 2. Furthermore, as btrain increases, uniform SVD-RND approaches the performance of the optimized SVD-RND. 5.3 FURTHER IMPROVEMENT OF SVD-RND\nWhile SVD-RND achieves reasonable OOD detection performance, combining SVD-RND with other baseline algorithms may further enhance the performance. For example, as shown in Table 2, training against rotated data benefits OOD detection in CelebA dataset. Therefore, we combine SVD-RND and geometric transform-based methods to further improve SVD-RND. We treat both blurred data and geometrically transformed data as OOD and train the predictor network to discriminate the original data from the OOD. We combine rotation and vertical translation with SVD-RND and denote them as SVD-ROT-RND and SVD-VER-RND, respectively.\nWe compare the performance of SVD-ROT-RND and SVD-VER-RND against rotation and vertical translation in CelebA : (SVHN, CIFAR-10, CIFAR-100) domains. We refer readers to the results in Table 5. We observe that SVD-ROT-RND and SVD-VER-RND outperform their counterparts and SVD-RND. Especially, SVD-ROT-RND and SVD-VER-RND show significant performance gains in CelebA : (CIFAR-10, CIFAR-100) domains. 6 CONCLUSION\nIn this work, blurred images are introduced as adversarial examples in deep OOD detection. SVDRND is employed for adversarial defense against blurred images. SVD-RND achieves significant performance gain in all In : Out domains. Even without the validation OOD data, we can design SVD-RND to outperform conventional OOD detection models. 7 ACKNOWLEDGEMENT\nThis work was supported by Samsung Electronics and the ICT R&D program of MSIP/IITP. [2016- 0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion] A FULL OOD DETECTION RESULTS\n B RANDOM NETWORK DISTILLATION\nWe use RND (Burda et al., 2019) as the base model of our OOD detector. RND consists of a trainable predictor network f , and a randomly initialized target network g. The predictor network is trained to minimize the l2 distance against the target network on training data. The target network g remains fixed throughout the training phase.\nf\u02da \u201c arg min f \u03a3xPDtrain }fpxq \u00b4 gpxq} 2 2 (5)\nThen, for unseen test data x, RND outputs }fpxq \u00b4 gpxq}22 as an uncertainty measure. The main intuition of the RND is to reduce the distance between f and g only on the in-distribution, hence naturally distinguish between the in-distribution and the OOD.\nWe employ RND for our base OOD detector due to its simplicity over generative models. Also, RND has already shown to be effective in novelty detection on MNIST dataset (Burda et al., 2019).\nIn RND (Burda et al., 2019), the predictor network f has two more layers than the target network g, where g consists of 3 convolution layers and a fully connected layer. In our experiments, we set g as the first 33 layers of ResNet34 without ReLU activation in the end. f is constructed by appending two sequential residual blocks. The output size of each residual block is 1024 and 512. We also discard ReLU activation in the second residual block to match the output layer of g. C DATA PREPROCESSING, NETWORK SETTINGS, PARAMETER SETTINGS FOR MAIN EXPERIMENT\nTo make the OOD detection task harder, we reduce each of the training dataset of CelebA, TinyImageNet, and LSUN to 50000 examples and test dataset of CelebA to 26032 examples. For TinyImageNet data, we discard half of the images in each class, resulting in 250 training examples for each of the 200 classes. Reduction in LSUN dataset results in 5000 examples for each of the 10 classes. Also, the first 1000 images of the test OOD data are used for validation. For SVD-RND and all other RND-based detectors, we use the same structure for f and g defined in Appendix B. The number of parameter updates is fixed across the experiments. The Adam optimizer, with a learning rate of 10\u00b44, is used for RND-based OOD detection methods. The learning rate is annealed from 10\u00b44 to 10\u00b45 in half of the training process. For our main experiment, we average the result across two fixed random seeds.\nIn SVD-RND, DCT-RND, and GB-RND, we used btrain \u201c 1 for CIFAR-10 and CelebA dataset, and btrain \u201c 2 for TinyImageNet and LSUN dataset. For SVD-RND, We optimize across K1 P t18, 20, 22, 24, 25, 26, 27, 28u in the CIFAR-10 and CelebA datasets. For TinyImageNet and LSUN datasets, we optimize over K1 P t8, 10, 12, 14u and K2 P t22, 24, 26, 28u. In DCT-RND, we define Ki as the number of unpruned signals in the frequency domain. For CIFAR-10 and CelebA datasets, we optimize K1 across t4, 8, 12, 14, 16, 20, 24, 28u. For TinyImageNet and LSUN datasets, we optimize over K1 P t20, 24, 28, 32u and K2 P t40, 44, 48, 52u. For Gaussian blurring, we optimize over the shape pxi, yiq of the Gaussian kernel. We optimized the parameter over xi P t1, 3, 5u , yi P t1, 3, 5u for each blurred data. To fix the number of updates, we train SVD-RND, DCT-RND, and GB-RND for 50 epochs in the CIFAR-10 and CelebA datasets, and for 34 epochs for the rest.\nFor GPND, the settings for the original paper are followed. Furthermore, we optimize the reconstruction loss \u03bb1 and adversarial loss \u03bb2 for discriminator Dz across \u03bb1 P t8, 9, 10, 11, 12u and \u03bb2 P t1, 2, 3u. We choose the parameters with the best validation performance at 100 epochs, For RND, we trained over 100 epochs.\nFor geometric transforms, we optimize the magnitude of the shift of shear, horizontal translation and vertical translation methods. We optimize the magnitude of translation across t4, 8, 12, 16u and choose the parameter with the best validation performance. Detector is trained for r 100|T |`1 s epochs, where |T | is the number of transformations. The number of transformations is 1 in flipping and invert, 2 for horizontal translation, vertical translation, and shear, and 3 for rotation and contrast.\nFinally, for typicality test, we estimated the average test loss of the RND for 50000 training examples. For each test sample, we use the distance between the test loss of the sample and the estimated average loss as the OOD metric. D GENERATING OOD BY ADDING ORTHOGONAL VECTORS\nWe present the performance of RND on OODs generated by adding vectors orthogonal to the data. To genetrate such OODs, we sample a Gaussian vector z and compute the component of the random vector zorth,x that is orthogonal to the data x.\nzorth,x \u201c z \u00b4 zTx\nxTx x (6)\nWe scaled the l2 norm of the orthogonal vector zorth,x on each data to be \u03b1% of the l2 norm of the signal. We plot the average uncertainty of RND on the original data, blurred data, and the perturbed data in Table 7. From the 20 independent runs on the perturbed data, we report the case with smallest test uncertainty in Table 7. We varied \u03b1 from 5 to 20. While blurring reduces the average test uncertainty of RND, adding orthogonal vector to the data incerases the test uncertainty of RND.\nE VISUALIZATION OF DIFFERENT BLURRING TECHNIQUES\nFor visualization, we plot the CIFAR-10 images and their blurred versions processed by SVD-RND, DCT-RND, and GB-RND in Figure 5. Images in the same column are processed with the same technique. Furthermore, columns (b), (d), (e) are constructed by the best performing parameters of SVD-RND, DCT-RND, and GB-RND on SVHN OOD data. Likewise, (c), (e), (f) are constructed by the best performing parameters of SVD-RND, DCT-RND, and GB-RND on TinyImageNet OOD data. F EXTENDED VERSION OF FIGURE 3\nWe further extend Figure 3 to analyze the behavior of SVD-RND when small number of singular values are discarded. Therefore, we experiment SVD-RND where K1 \u201c 5, 10, 15 and plot the result in Figure 6.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Exploiting semantic information and deep matching for optical flow , Authors: Min Bai Wenjie Luo... , 2016\n2.  Title: Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos , Authors: Vincent Casser Soeren Pirk... , 2019\n3.  Title: Depth map prediction from a single image using a multi-scale deep network , Authors: David Eigen Christian Puhrsch... , 2014\n4.  Title: Deep ordinal regression network for monocular depth estimation , Authors: Huan Fu Mingming Gong... , 2018\n5.  Title: Unsupervised cnn for single view depth estimation: Geometry to the rescue , Authors: Ravi Garg Vijay Kumar BG... , 2016\n6.  Title: Unsupervised monocular depth estimation with left-right consistency , Authors: Cl\u00e9ment Godard Oisin Mac Aodha... , 2017\n7.  Title: Digging into selfsupervised monocular depth prediction , Authors: Cl\u00e9ment Godard Oisin Mac Aodha... , 2018\n8.  Title: Packnet-sfm: 3d packing for self-supervised monocular depth estimation , Authors: Vitor Guizilini Sudeep Pillai... , 2019\n9.  Title: Spatial transformer networks. In Advances in neural information processing , Authors: Max Jaderberg Karen Simonyan... , 2017\n10.  Title: Single-image depth estimation based on fourier domain analysis , Authors: Jae-Han Lee Minhyeok Heo... , 2018\n11.  Title: Spigan: Privileged adversarial learning from simulation , Authors: Kuan-Hui Lee German Ros... , 2019\n12.  Title: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs , Authors: Bo Li Chunhua Shen... , 2015\n13.  Title: Learning to fuse things and stuff , Authors: Jie Li Allan Raventos... , 2018\n14.  Title: Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints , Authors: Reza Mahjourian Martin Wicke... , 2018\n15.  Title: Geonet: Geometric neural network for joint depth and surface normal estimation , Authors: Xiaojuan Qi Renjie Liao... , 2018\n16.  Title: Demon: Depth and motion network for learning monocular stereo , Authors: Benjamin Ummenhofer Huizhong Zhou... , 2017\n17.  Title: Learning depth from monocular videos using direct methods , Authors: Chaoyang Wang Jos\u00e9 Miguel Buenaposada... , 2018\n18.  Title: Image quality assessment: from error visibility to structural similarity , Authors: Zhou Wang Alan C Bovik... , 2004\n19.  Title: Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry , Authors: Nan Yang Rui Wang... , 2018\n20.  Title: Unsupervised learning of geometry with edge-aware depth-normal consistency , Authors: Zhenheng Yang Peng Wang... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.\u2020", "Since the seminal work of Eigen et al. (2014), substantial progress has been done to improve the accuracy of supervised depth estimation from monocular images, including the use of Conditional Random Fields (CRFs) (Li et al., 2015), joint optimization of surface normals (Qi et al., 2018), fusion of multiple depth maps (Lee et al., 2018), and ordinal classification (Fu et al., 2018). Consequently, as supervised techniques for depth estimation advanced rapidly, the availability of largescale depth labels became a bottleneck, especially for outdoor applications. Garg et al. (2016) and Godard et al. (2017) provided an alternative self-supervised strategy involving stereo cameras, where Spatial Transformer Networks (Jaderberg et al., 2015) can be used to geometrically warp, in a differentiable way, the right image into a synthesized left image, using the predicted depth from the left image. The photometric consistency loss between the resulting synthesized and original left images can then be minimized in an end-to-end manner using a Structural Similarity term (Wang et al., 2004) and additional depth regularization terms. Following Godard et al. (2017) and Ummenhofer et al. (2017), Zhou et al. (2017) generalized this to the purely monocular setting, where a depth and a pose networks are simultaneously learned from unlabeled monocular videos. Rapid progress in terms of architectures and objective functions (Yin & Shi, 2018; Mahjourian et al., 2018; Casser et al., 2019; Zou et al., 2018; Klodt & Vedaldi, 2018; Wang et al., 2018; Yang et al., 2018) have since then turned monocular depth estimation into one of the most successful applications of self-supervised learning, even outperforming supervised methods (Guizilini et al., 2019). The introduction of semantic information to improve depth estimates has been explored in prior works, and can be broadly divided into two categories. The first one uses semantic (or instance) information to mask out or properly model dynamic portions of the image, which are not accounted for in the photometric loss calculation. Gu\u0308ney & Geiger (2015) leveraged object knowledge in a Markov Random Field (MRF) to resolve stereo ambiguities, while Bai et al.", "The paper proposes a using pixel-adaptive convolutions to leverage semantic labels in self-supervised monocular depth estimation. The semantic features are predicted by a pretrained network rather than relying on a ground truth. Moreover, a two-stage training process in proposed in order to filter out images leading to erroneous SfM predictions. The method is evaluated with different networks on the KITTY dataset.\n\nThe paper is very well written and clear. The applications of per-pixel convolutions to this problem seems sound and the experimental validation seems satisfactory. I have however one main concern (1) and a few additional questions below:\n\n1) While (Guizilini 2019) shows that using a larger set of unannotated videos and allows the self-supervised method to eventually outperform supervised methods, this study is not done here. This makes me question the applicability of the approach, as using large unlabelled videos would probably lead to noisy segmentations that could be unhelpful to the depth estimation. Showing an improvement over the supervised baseline would be a much stronger experimental validation, as for now it is difficult to know exactly why in which scenario this method should be used, rather than a supervised network or vanilla packnet.\n\n2) I see that you obtain the same numbers in Table 2 / PackNet / row 1 as in (Guizilini 2019); I would like to confirm that you used exactly their self-objective loss, in all your experiments? I would suggest adding to section 3.1. the fact the fact that the loss is the same is in (Guizlini 2019), as a reader could assume that there is novelty in the loss formulation.\n\n3) Have you tried fine-tuning the whole architecture including the semantic network end-to-end?\n\n6: Weak Accept"], "input": "1 INTRODUCTION\nAccurate depth estimation is a key problem in computer vision and robotics, as it is instrumental for perception, navigation, and planning. Although perceiving depth typically requires dedicated sensors (e.g., stereo rigs, LiDAR), learning to predict depth from monocular imagery can provide useful cues for a wide array of tasks (Michels et al., 2005; Kendall et al., 2018; Manhardt et al., 2019; Lee et al., 2019). Going beyond supervised learning from direct measurements (Eigen et al., 2014), self-supervised methods exploit geometry as supervision (Guo et al., 2018; Pillai et al., 2019; Zou et al., 2018; Yang et al., 2017), therefore having the potential to leverage large scale datasets of raw videos to outperform supervised methods (Guizilini et al., 2019).\nAlthough depth from a single image is an ill-posed inverse problem, monocular depth networks are able to make accurate predictions by learning representations connecting the appearance of scenes and objects with their geometry in Euclidean 3D space. Due to perspective, there is indeed an equivariance relationship between the visual appearance of an object in 2D and its depth, when conditioned on the object\u2019s category. For instance, a car 25 meters away appears smaller (on the image plane) than a car only 5 meters away but bigger than a truck 50 meters away. Current depth estimation methods either do not leverage this structure explicitly or rely on strong semantic supervision to jointly optimize geometric consistency and a semantic proxy task in a multi-task objective (Ochs et al., 2019; Chen et al., 2019), thus departing from the self-supervised paradigm.\nIn this paper, we explore how we can leverage semantic information to improve monocular depth prediction in a self-supervised way. Our main contribution is a novel architecture that uses a fixed pre-trained semantic segmentation network to guide geometric representation learning in a selfsupervised monocular depth network. In contrast to standard convolutional layers, our architecture uses pixel-adaptive convolutions (Su et al., 2019) to learn semantic-dependent representations that can better capture the aforementioned equivariance property. Leveraging semantics may nonetheless introduce category-specific biases. Our second contribution is a two-stage training process where we automatically detect the presence of a common bias on dynamic objects (projections at infinity) and resample the training set to de-bias it. Our method improves upon the state of the art in selfsupervised monocular depth estimation on the standard KITTI benchmark (Geiger et al., 2013), both on average over pixels, over classes, and for dynamic categories in particular.\n\u2020Source code and pretrained models are available on https://github.com/TRI-ML/packnet-sfm 3 SELF-SUPERVISED STRUCTURE-FROM-MOTION\nOur semantically-guided architecture is developed within a self-supervised monocular depth estimation setting, commonly known as structure-from-motion (SfM). Learning in a self-supervised structure-from-motion setting requires two networks: a monocular depth model fD : I \u2192 D, that outputs a depth prediction D\u0302 = fD(I(p)) for every pixel p in the target image I; and a monocular ego-motion estimator fx : (It, IS) \u2192 xt\u2192S , that predicts the 6 DoF transformations for all s \u2208 S given by xt\u2192s = ( R t0 1 ) \u2208 SE(3) between the target image It and a set of temporal context source images Is \u2208 IS . In all reported experiments we use It\u22121 and It+1 as source images. 3.1 THE SELF-SUPERVISED OBJECTIVE LOSS\nWe train the depth and pose networks simultaneously, using the same protocols and losses as described in Guizilini et al. (2019). Our self-supervised objective loss consists of an appearance matching term Lp that is imposed between the synthesized I\u0302t and original It target images, and a depth regularization term Ls that ensures edge-aware smoothing in the depth estimates D\u0302t. The final objective loss is averaged per pixel, pyramid scale and image batch, and is defined as:\nL(It, I\u0302t) = Lp(It, I\u0302t) + \u03bb1 Ls(D\u0302t) (1) where \u03bb1 is a weighting coefficient between the photometricLp and depth smoothnessLs loss terms. Following Godard et al. (2017) and Zhou et al. (2017), the similarity between synthesized I\u0302t and original It target images is estimated using a Structural Similarity (SSIM) term (Wang et al., 2004) combined with an L1 loss term, inducing the following overall photometric loss:\nLp(It, I\u0302t) = \u03b1 1\u2212 SSIM(It, I\u0302t)\n2 + (1\u2212 \u03b1) \u2016It \u2212 I\u0302t\u2016 (2)\nIn order to regularize the depth in low gradient regions, we incorporate an edge-aware term similar to Godard et al. (2017). This loss is weighted for each of the pyramid levels, decaying by a factor of 2 on each downsampling, starting with a weight of 1 for the 0th pyramid level.\nLs(D\u0302t) = |\u03b4xD\u0302t|e\u2212|\u03b4xIt| + |\u03b4yD\u0302t|e\u2212|\u03b4yIt| (3) We also incorporate some of the insights introduced in Godard et al. (2018), namely auto-masking, minimum reprojection error, and inverse depth map upsampling to further improve depth estimation performance in our self-supervised monocular setting. 3.2 DEPTH AND POSE NETWORKS\nOur baseline depth and pose networks are based on the PackNet architecture introduced by Guizilini et al. (2019), which proposes novel packing and unpacking blocks to respectively downsample and upsample feature maps during the encoding and decoding stages. This network was selected due to its state-of-the-art performance in the task of self-supervised monocular depth estimation, so we can analyze if our proposed architecture is capable of further improving the current state-of-the-art. However, there are no restrictions as to which models our proposed semantically-guided architecture can be applied to, and in Section 5.4 we study its application to different depth networks. 4 SEMANTICALLY-GUIDED GEOMETRIC REPRESENTATION LEARNING\nIn this section, we describe our method to inject semantic information into a self-supervised depth network via its augmentation with semantic-aware convolutions. Our proposed architecture is depicted in Figure 2 and is composed of two networks: a primary one, responsible for the generation of depth predictions D\u0302 = fD(I(p)); and a secondary one, capable of producing semantic predictions. Only the first network is optimized during self-supervised learning; the semantic network is initialized from pretrained weights and is not further optimized. This is in contrast to the common practice of supervised (ImageNet) pretraining of depth encoders (Godard et al., 2018; Casser et al., 2019; Zou et al., 2018): here instead of fine-tuning from pre-trained weights, we preserve these secondary weights to guide the feature learning process of the primary depth network. Our approach also differs from learning without forgetting (Li & Hoiem, 2017) by leveraging fixed intermediate feature representations as a way to maintain consistent semantic guidance throughout training. 4.1 SEMANTICALLY-GUIDED DEPTH FEATURES\nWe leverage the information from the pretrained semantic network in the depth network through the use of pixel-adaptive convolutions (Su et al., 2019). They were recently proposed to address some limitations inherent to the standard convolution operation, namely its translation invariance making it content-agnostic. While this significantly reduces the number of parameters of the resulting network, this might also lead to sub-optimal solutions under certain conditions important for geometric representation learning. For example, spatially-shared filters globally average the loss gradients over the entire image, forcing the network to learn weights that cannot leverage location-specific information beyond their limited receptive fields. Content-agnostic filters are unable to distinguish between different pixels that are visually similar (i.e. dark areas due to shadows or black objects) or generalize to similar objects that are visually different (i.e. cars with varying colors). In this work, we use pixel-adaptive convolutions to produce semantic-aware depth features, where the fixed information encoded in the semantic network is used to disambiguate geometric representations for the generation of multi-level depth features.\nAs shown in Figure 2, we extract multi-level feature maps from the semantic network. For each feature map, we apply a 3 \u00d7 3 and a 1 \u00d7 1 convolutional layer followed by Group Normalization (Wu & He, 2018) and ELU non-linearities (Clevert et al., 2016). These processed semantic feature maps are then used as guidance on their respective pixel-adaptive convolutional layers, following the formulation proposed in Su et al. (2019):\nv\u2032i = \u2211 j\u2208\u2126(i) K(fi, fj)W[pi \u2212 pj ]vj + b (4)\nIn the above equation, f \u2208 RD are processed features from the semantic network that will serve to guide the pixel-adaptive convolutions from the depth network, p = (x, y)T are pixel coordinates, with [pi \u2212 pj ] denoting 2D spatial offsets between pixels, Wk\u00d7k are convolutional weights with kernel size k, \u2126i defines a k\u00d7k convolutional window around i, v is the input signal to be convolved, and b \u2208 R1 is a bias term. K is the kernel used to calculate the correlation between guiding features, here chosen to be the standard Gaussian kernel:\nK(fi, fj) = exp ( \u22121\n2 (fi \u2212 fj)T\u03a3\u22121ij (fi \u2212 fj)\n) (5)\nwhere \u03a3ij is the covariance matrix between features fi and fj , here chosen to be a diagonal matrix \u03c32 \u00b7ID, with \u03c3 as an extra learnable parameter for each convolutional filter. These kernel evaluations can be seen as a secondary set of weights applied to the standard convolutional weights, changing their impact on the resulting depth features depending on the content stored in the guiding semantic features. For example, the information contained in depth features pertaining to the sky should not be used to generate depth features describing a pedestrian, and this behavior is now captured as a larger distance between their corresponding semantic features, which in turn produces smaller weights for that particular convolutional filter. Note that the standard convolution can be considered a special case of the pixel-adaptive convolution, where \u2200 ij,K(fi, fj) = 1. 4.2 SEMANTIC GUIDANCE NETWORK\nAs the secondary network used to provide semantic guidance for the generation of depth features, we use a Feature Pyramid Network (FPN) with ResNet backbone (Lin et al., 2017). This architecture has been shown to be efficient for both semantic and instance-level predictions towards panoptic segmentation (Kirillov et al., 2019; Li et al., 2018; Xiong et al., 2019; Porzi et al., 2019). While our proposed semantically-guided architecture is not restricted to any particular network, we chose this particular implementation to facilitate the future exploration of different sources for guidance information. Architectural details follow the protocols described in Li et al. (2018), and unless mentioned otherwise the same pretrained model was used in all reported experiments. The semantic network is assumed fixed, pretrained on a held out dataset different than the raw data used for selfsupervised learning, i.e. we do not require any semantic ground truth on the target dataset. 4.3 TWO-STAGE TRAINING\nOne well-known limitation of the self-supervised photometric loss is its inability to model dynamic objects, due to a static world assumption that only accounts for camera ego-motion (Godard et al., 2018; Casser et al., 2019). A resulting common failure mode is the infinite depth problem, which is caused by the presence of objects moving at the same speed as the camera. This typically causes distinct holes in the predicted depth maps, with arbitrarily large values where these objects should be. This severely hinders the applicability of such models in real-world applications, particularly for automated driving, where the ability to detect and properly model dynamic objects is crucial. Moreover, this limitation may be further accentuated in our proposed semantically-guided architecture, as the infinite depth problem occurs mostly on dynamic categories (i.e. cars and motorcycles) and the semantic-aware features may reinforce this bias.\nWe propose a simple and efficient two-stage training method to detect and remove this bias from the training set. In the first stage, we learn a standard depth network on all available training data. This network, exhibiting the infinite depth problem, is then used to resample the dataset by automatically filtering out sequences with infinite depth predictions that violate a basic geometric prior. We indeed find that depth predictions for pixels corresponding to the nearby ground plane are generally robust. This enables getting a coarse estimate of the ground plane using RANSAC and detecting the number of pixels whose predicted depth projects them significantly below the ground. If that number is above a threshold, then the corresponding image is subsequently ignored (we found a conservative threshold of 10 to work well in all our experiments, filtering out roughly 5% of the KITTI training dataset). During the second stage, we retrain the network on the subsampled dataset (from scratch to avoid the previous local optimum). As this subsampled dataset is de-biased, the network learns better depth estimates on dynamic objects. This process can be repeated, but we find that two stages are enough to remove any traces of infinite depth in our experiments, as shown in Figure 3. 5 EXPERIMENTAL RESULTS\n 5.1 DATASETS\nWe use the standard KITTI benchmark (Geiger et al., 2013) for self-supervised training and evaluation. More specifically, we adopt the training, validation and test splits used in Eigen et al. (2014) with the pre-processing from Zhou et al. (2017) to remove static frames, which is more suitable for\nmonocular self-supervised learning. This results in 39810 images for training, 4424 for validation, and 697 for evaluation. Following common practice, we pretrain our depth and pose networks on the CityScapes dataset (Cordts et al., 2016), consisting of 88250 unlabeled images. Unless noted otherwise, input images are downsampled to 640 x 192 resolution and output inverse depth maps are upsampled to full resolution using bilinear interpolation. Our fixed semantic segmentation network is pretrained on Cityscapes, achieving a mIoU of 75% on the validation set. 5.2 IMPLEMENTATION DETAILS\nWe implement our models with PyTorch (Paszke et al., 2017) and follow the same training protocols of Guizilini et al. (2019) when optimizing our depth and pose networks. The initial training stage is conducted on the CityScapes dataset for 50 epochs, with a batch size of 4 per GPU and initial depth and pose learning rates of 2 \u00b7 10\u22124 and 5 \u00b7 10\u22124 respectively, that are halved every 20 epochs. Afterwards, the depth and pose networks are fine-tuned on KITTI for 30 epochs, with the same parameters and halving the learning rates after every 12 epochs. This fine-tuning stage includes the proposed architecture, where information from the fixed semantic network, pretrained separately, is used to directly guide the generation of depth features. There is no direct supervision at any stage during depth training, all semantic information is derived from the fixed secondary network.\nWhen pretraining the semantic segmentation network, we use a ResNet-50 backbone with Imagenet (Deng et al., 2009) pretrained weights and optimize the network for 48k iterations on the CityScapes dataset with a learning rate of 0.01, momentum of 0.9, weight decay of 10\u22124, and a batch size of 1 per GPU. Random scaling between (0.7, 1.3), random horizontal flipping, and a crop size of 1000\u00d72000 are used for data augmentation. We decay the learning rate by a factor of 10 at iterations 36k and 44k. Once training is complete, the semantic segmentation network is fixed and becomes the only source of semantic information when fine-tuning the depth and pose networks on KITTI. 5.3 DEPTH ESTIMATION PERFORMANCE\nOur depth estimation results are summarized in Table 1, where we compare our proposed architecture with other published works. From these results we can see that the introduction of semanticallyguided geometric representation learning further improves upon the current state of the art in selfsupervised monocular depth estimation from Guizilini et al. (2019), which served as our baseline. Our approach also outperforms other methods that leverage semantic information by a substantial margin, even those using ground-truth KITTI semantic segmentation and depth labels during train-\ning (Ochs et al., 2019). Furthermore, in Figure 5 we also present qualitative results showing the improvements in depth estimation generated by our proposed framework, compared to our baseline. Note how our semantically-guided architecture produces sharper boundaries and better object delineation, especially in structures further away or not clearly distinguishable in the input image. 5.4 ABLATIVE ANALYSIS\n 5.4.1 DIFFERENT DEPTH NETWORKS\nTo better evaluate our main contribution, we provide an ablative analysis showing how it generalizes to different depth networks. To this end, we consider two variations of the widely used ResNet architecture as the encoder for our depth network: ResNet-18 and ResNet-50 (the same pretrained semantic network was used in all experiments). Depth estimation results considering these variations are shown in Table 2, where we can see that our proposed semantically-guided architecture is able to consistently improve the performance of different depth networks, for all considered metrics. 5.4.2 CLASS-SPECIFIC DEPTH PERFORMANCE\nTo further showcase the benefits of our semantically-guided architecture, we also provide classspecific evaluation metrics, as shown in Figure 4. As we do not have ground-truth semantic segmentation for these images, we use the prediction of the semantic network to bin pixels per predicted category, and evaluate only on those pixels. From these results we can see that our proposed architecture consistently improves depth performance for pixels across all predicted classes, especially those containing fine-grained structures and sharp boundaries, e.g. poles and traffic signs.\nWe also measure the impact of our two-stage training process, which is expected to address the infinite depth problem in dynamic objects. Although we find the pixel-average difference in performance to not be significant (see Table 2), there is a significant improvement in class-average depth estimation, from 0.121 to 0.117 Abs-Rel. This is because the number of pixels affected by the infinite depth problem is vastly smaller than the total number of pixels. However, when considering class-average depth evaluation, the improvement over classes such as cars (0.200 to 0.177 Abs-Rel) and motorcycles (0.091 to 0.069) becomes statistically significant. This further exemplifies the importance of fine-grained metrics in depth evaluation, so these underlying behaviors can be properly observed and accounted for in the development of new techniques. 6 CONCLUSION\nThis paper introduces a novel architecture for self-supervised monocular depth estimation that leverages semantic information from a fixed pretrained network to guide the generation of multi-level depth features via pixel-adaptive convolutions. Our monodepth network learns semantic-aware geometric representations that can disambiguate photometric ambiguities in a self-supervised learning structure-from-motion context. Furthermore, we introduce a two-stage training process that resamples training data to overcome a common bias on dynamic objects resulting in predicting them at infinite depths. Our experiments on challenging real-world data shows that our proposed architecture consistently improves the performance of different monodepth architectures, thus establishing a new state of the art in self-supervised monocular depth estimation. Future directions of research include leveraging other sources of guidance (i.e. instance masks, optical flow, surface normals), as well as avenues for self-supervised fine-tuning of the semantic network. A PRE-TRAINING THE SEMANTIC SEGMENTATION NETWORK\nThe introduction of a semantic segmentation network to the depth estimation task increases the depth estimation performance, however it also increases model complexity (e.g. number of trainable parameters). To investigate that the increased performance for the depth estimation task is indeed due to the semantic features encoded in the secondary network, we perform an in-depth analysis (summarized in Table 3) where we explore the impact of pre-training the semantic segmentation network before it is used to guide the generation of depth features. From these results we can see that the presence of semantic information encoded in the secondary network indeed leads to an increase in performance, and that fine-tuning this secondary network for the speficic task of depth estimation actually decreases performance.\nIn the first two rows an untrained semantic network is utilized, with only its encoder initialized from ImageNet (Deng et al., 2009) weights. Two different scenarios are explored: in the first one (D) only the depth network is fine-tuned in a self-supervised fashion, while in D+S both networks are fine-tuned together in the same way. As expected, using untrained features as guidance leads to significantly worse results, since there is no structure encoded in the secondary network and the primary network needs to learn to filter out all this spurious information. When both networks are fine-tuned simultaneously, results improve because now the added complexity from the secondary network can be leveraged for the task of depth estimation, however there is still no improvement over the baseline.\nNext, the semantic network was pre-trained on only half of the CityScapes (Cordts et al., 2016) dataset (samples chosen randomly), leading to a worse semantic segmentation performance (validation mIoU of around 70% vs. 75% for the fully trained one). This partial pre-training stage was enough to enable the transfer of useful information between networks, leading to improvements over the baseline. Interestingly, fine-tuning both networks for the task of depth estimation actually hurt performance this time, which we attribute to forgetting the information contained in the secondary network, as both networks are optimized for the depth task. When the semantic network is pretrained with all of CityScapes (last two rows), these effects are magnified, with fine-tuning only the depth network leading to our best reported performance (Table 1) and fine-tuning both networks again leading to results similar to the baseline. B UNCERTAINTY AND GENERALIZATION TO DIFFERENT OBJECTS\nIn a self-supervised setting, increasing the number of unlabeled videos used for depth training is expected to lead to an increasing specialization away from the domain in which the semantic network was pre-trained. This might result in harmful guidance if our method is not robust to this gap. However, our approach does not use semantic predictions directly, but rather the decoded features of the semantic network themselves, which represent general appearance information that should be more robust to this domain gap. To validate our hypothesis, we further explore the impact\nof erroneous semantic information in the performance of our proposed semantically-guided depth framework. In Figure 6 we present qualitative results highlighting situations in which our pretrained semantic network failed to generate correct semantic predictions for certain objects in the scene, and yet our proposed framework was still able to properly recover depth values for that portion of the environment. These exemplify possible scenarios for erroneous semantic prediction.\n\u2022 Imprecise boundaries: in the first row, we can see that the semantic segmentation network does not correctly detect the traffic sign, yet the semantically-guided depth network predicts its shape and depth accurately.\n\u2022 Wrong classification: in the second row, the truck was mistakenly classified as partially \u201croad\u201d and \u201cbuilding\u201d, however our semantically-guided depth network was still able to properly recover its overall shape with sharp delineation that was not available from its semantic contour. A similar scenario happens in the same image, with \u201cfence\u201d being partially labeled as \u201cbicycle\u201d.\n\u2022 Missing ontology: there is no \u201ctrash can\u201d class on the CityScapes ontology, however in the third row our semantically-guided depth network was able to correctly reconstruct such object even though it was classified as \u201cfence\u201d, similarly to its surroundings.\n\u2022 Object Hallucination: in the fourth row, the contour of a \u201cperson\u201d was erroneously introduced in the image and correctly removed by our semantically-guided framework.\nThese examples are evidence that our proposed framework is able to reason over the uncertainty inherent to semantic classification, leveraging this information when accurate to achieve the results reported in this paper, but also discarding it if necessary to generate a better reconstruction according to the self-supervised photometric loss. C GENERALIZATION TO DIFFERENT DATASETS\nIn the previous sections, we show that our proposed framework is robust to a degraded semantic network, both by pretraining the semantic network with fewer annotated labels (Appendix A) and also by providing evidence that the depth network is able to reason over erroneous predictions to still generate accurate reconstructions (Appendix B). We now go one step further and analyze how our proposed semantically-guided framework generalizes to a dataset that was used neither during pre-training nor for fine-tuning. To this end, we evaluate our KITTI depth model on the recently released NuScenes dataset (Caesar et al., 2019). The official NuScenes validation split is used, containing 6019 images from the front camera with ground-truth depth maps generated by LiDAR reprojection. Results presented in Table 4 provide additional evidence that our method indeed results in generalization improvements, even on significantly different data from different platforms\nand environments (Karlsruhe, Germany for KITTI vs Boston, USA and Singapore for NuScenes), outperforming the state of the art methods and our baseline (Guizilini et al., 2019).", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Neural networks and principal component analysis: Learning from examples without local minima , Authors: Pierre Baldi Kurt Hornik , 1989\n2.  Title: The loss surfaces of multilayer networks , Authors: Anna Choromanska Mikael Henaff... , 2015\n3.  Title: On the power of over-parametrization in neural networks with quadratic activation , Authors: Simon S Du Jason D Lee , 2018\n4.  Title: Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced , Authors: Simon S Du Wei Hu... , 2018\n5.  Title: When is a convolutional filter easy to learn , Authors: Simon S Du Jason D Lee... , 2018\n6.  Title: Gradient descent learns onehidden-layer cnn: Dont be afraid of spurious local minima , Authors: Simon S Du Jason D Lee... , 2018\n7.  Title: Escaping from saddle pointsonline stochastic gradient for tensor decomposition , Authors: Rong Ge Furong Huang... , 2015\n8.  Title: Matrix completion has no spurious local minimum , Authors: Rong Ge Jason D Lee... , 2016\n9.  Title: Global optimality in neural network training , Authors: Benjamin Haeffele Ren\u00e9 Vidal , 2017\n10.  Title: Identity matters in deep learning , Authors: Moritz Hardt Tengyu Ma , 2016\n11.  Title: Deep learning without poor local minima , Authors: Kenji Kawaguchi , 2016\n12.  Title: Deep linear networks with arbitrary loss: All local minima are global , Authors: Thomas Laurent James Brecht , 2018\n13.  Title: Gradient descent only converges to minimizers , Authors: Jason D Lee Max Simchowitz... , 2016\n14.  Title: The loss surface of deep and wide neural networks , Authors: Quynh Nguyen Matthias Hein , 2017\n15.  Title: The loss surface and expressivity of deep convolutional neural networks , Authors: Quynh Nguyen Matthias Hein , 2018\n16.  Title: Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions , Authors: Ioannis Panageas Georgios Piliouras , 2017\n17.  Title: Spurious local minima are common in two-layer relu neural networks , Authors: Itay Safran Ohad Shamir , 2018\n18.  Title: Exponential convergence time of gradient descent for one-dimensional deep linear neural networks , Authors: Ohad Shamir , 2018\n19.  Title: No bad local minima: Data independent training error guarantees for multilayer neural networks , Authors: Daniel Soudry Yair Carmon , 2016\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as x 7\u2192 WNWN\u22121 \u00b7 \u00b7 \u00b7W1x) by minimizing the `2 loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).", "Theoretical study of gradient-based optimization in deep learning is a highly active area of research. As discussed in Section 1, a popular approach is to show that the objective landscape admits the properties of no poor local minima and strict saddle, which, by Ge et al. (2015); Lee et al. (2016); Panageas and Piliouras (2017), ensure convergence to global minimum. Many works, both classic (e.g. Baldi and Hornik (1989)) and recent (e.g. Choromanska et al. (2015); Kawaguchi (2016); Hardt and Ma (2016); Soudry and Carmon (2016); Haeffele and Vidal (2017); Nguyen and Hein (2017); Safran and Shamir (2018); Nguyen and Hein (2018); Laurent and Brecht (2018)), have focused on the validity of these properties in different deep learning settings. Nonetheless, to our knowledge,\n9https://github.com/tensorflow/tensorflow/tree/master/tensorflow/ examples/tutorials/mnist\n10As opposed to the dataset used in our experiments with linear networks, measuring the training loss with MNIST is non-trivial computationally (involves passing through 60K examples). Therefore, rather than continuously polling training loss until it reaches a certain threshold, in this experiment we chose to evaluate speed of convergence by measuring the training loss once after a predetermined number of iterations. the success of landscape-driven analyses in formally proving convergence to global minimum for a gradient-based algorithm, has thus far been limited to shallow (two layer) models only (e.g. Ge et al. (2016); Du and Lee (2018); Du et al. (2018a)). An alternative to the landscape approach is a direct analysis of the trajectories taken by the optimizer.", "This paper studies the convergence of gradient descent on the squared loss of deep linear neural networks. The authors prove linear convergence rate if (1) the network dimensions are big enough so that the full product can have full rank, (2) the singular values of each weight matrices are approximately the same, (3) the initialized point is \u201cclose enough\u201d to the target.\n\nMy main concern about the study of GD on linear neural network is whether we really get any \u201cbenefit\u201d or \u201cacceleration\u201d from depth, i.e., is GD on linear neural nets any faster than GD on linear models. It\u2019s been shown that we get acceleration in some cases (e.g., $\\ell_p$ regression when $p>2$ [Arora et al. 18\u2019]), but some other results (e.g., [Shamir 18\u2019] mentioned in Section 5) show that GD on linear neural nets (when weight matrices are all scalar) suffer exponential (in depth) increase in convergence time at near zero region, due to the vanishing gradient phenomenon. From my understanding, this paper circumvents this problem by assuming deficiency margin, because in the setting of [Shamir 18\u2019], deficiency margin means that the initialized product ($W_{1:N}$) has the same sign as $\\Phi$ and far enough from zero, so we don\u2019t have to pass through the near-zero region.\n\nBelow, I\u2019ll list specific concerns/questions/comments.\n* In my opinion, the statements about \u201cnecessity\u201d of two key assumptions are too strong, because the authors only provide counterexamples of non-convergence. As [Theorem 3, Shamir 18\u2019] shows (although in scalar case), even when the assumptions are not satisfied, a convergence rate $O(exp(N) * log(1/\\epsilon))$ is possible. It will be an interesting future work to clearly delineate the boundary between convergence and non-convergence.\n\n* In Thm 2 and Claim 3, what happens if dimension $d_0$ is smaller? What is the reason that you had to restrict it to high dimension? Is it due to high variance with few samples?\n\n* In Thm 2, constants $d\u2019_0$ and $a$ hide the dependence of the result on p, but I would suggest stating the dependence of those parameters on p, and also dependence on other parameters such as N.\n\n* In Section 5, there is a statement \u201cThis negative result, a theoretical manifestation of the \u201cvanishing gradient problem\u201d, is circumvented by balanced initialization.\u201d Can you elaborate more on that? If my understanding is correct, there is still $\\sigma_min$ multiplier in Eq (9), which means that at near-zero regions, the gradient will still vanish.\n\n7: Good paper, accept"], "input": "1 INTRODUCTION\nDeep learning builds upon the mysterious ability of gradient-based optimization methods to solve related non-convex problems. Immense efforts are underway to mathematically analyze this phenomenon. The prominent landscape approach focuses on special properties of critical points (i.e. points where the gradient of the objective function vanishes) that will imply convergence to global optimum. Several papers (e.g. Ge et al. (2015); Lee et al. (2016)) have shown that (given certain smoothness properties) it suffices for critical points to meet the following two conditions: (i) no poor local minima \u2014 every local minimum is close in its objective value to a global minimum; and (ii) strict saddle property \u2014 every critical point that is not a local minimum has at least one negative eigenvalue to its Hessian. While condition (i) does not always hold (cf. Safran and Shamir (2018)), it has been established for various simple settings (e.g. Soudry and Carmon (2016); Kawaguchi (2016)). Condition (ii) on the other hand seems less plausible, and is in fact provably false for models with three or more layers (cf. Kawaguchi (2016)), i.e. for deep networks. It has only been established for problems involving shallow (two layer) models, e.g. matrix factorization (Ge et al. (2016); Du et al. (2018a)). The landscape approach as currently construed thus suffers from inherent limitations in proving convergence to global minimum for deep networks.\nA potential path to circumvent this obstacle lies in realizing that landscape properties matter only in the vicinity of trajectories that can be taken by the optimizer, which may be a negligible portion of the overall parameter space. Several papers (e.g. Saxe et al. (2014); Arora et al. (2018)) have taken this trajectory-based approach, primarily in the context of linear neural networks \u2014 fully-connected neural networks with linear activation. Linear networks are trivial from a representational perspective, but not so in terms of optimization \u2014 they lead to non-convex training problems with multiple minima and saddle points. Through a mix of theory and experiments, Arora et al. (2018) argued that such non-convexities may in fact be beneficial for gradient descent, in the sense that sometimes, adding (redundant) linear layers to a classic linear prediction model can accelerate the optimization. This phenomenon challenges the holistic landscape view, by which convex problems are always preferable to non-convex ones.\nEven in the linear network setting, a rigorous proof of efficient convergence to global minimum has proved elusive. One recent progress is the analysis of Bartlett et al. (2018) for linear residual networks \u2014 a particular subclass of linear neural networks in which the input, output and all hidden dimensions are equal, and all layers are initialized to be the identity matrix (cf. Hardt and Ma (2016)). Through a trajectory-based analysis of gradient descent minimizing `2 loss over a whitened dataset (see Section 2), Bartlett et al. (2018) show that convergence to global minimum at a linear rate \u2014 loss is less than > 0 after O(log 1 ) iterations \u2014 takes place if one of the following holds: (i) the objective value at initialization is sufficiently close to a global minimum; or (ii) a global minimum is attained when the product of all layers is positive definite.\nThe current paper carries out a trajectory-based analysis of gradient descent for general deep linear neural networks, covering the residual setting of Bartlett et al. (2018), as well as many more settings that better match practical deep learning. Our analysis draws upon the trajectory characterization of Arora et al. (2018) for gradient flow (infinitesimally small learning rate), together with significant new ideas necessitated due to discrete updates. Ultimately, we show that when minimizing `2 loss of a deep linear network over a whitened dataset, gradient descent converges to the global minimum, at a linear rate, provided that the following conditions hold: (i) the dimensions of hidden layers are greater than or equal to the minimum between those of the input and output; (ii) layers are initialized to be approximately balanced (see Definition 1) \u2014 this is met under commonplace near-zero, as well as residual (identity) initializations; and (iii) the initial loss is smaller than any loss obtainable with rank deficiencies \u2014 this condition will hold with probability close to 0.5 if the output dimension is 1 (scalar regression) and standard (random) near-zero initialization is employed. Our result applies to networks with arbitrary depth and input/output dimensions, as well as any configuration of hidden layer widths that does not force rank deficiency (i.e. that meets condition (i)). The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the case of scalar regression, they are met with constant probability under a random initialization scheme. We are not aware of any similarly general analysis for efficient convergence of gradient descent to global minimum in deep learning.\nThe remainder of the paper is organized as follows. In Section 2 we present the problem of gradient descent training a deep linear neural network by minimizing the `2 loss over a whitened dataset. Section 3 formally states our assumptions, and presents our convergence analysis. Key ideas brought forth by our analysis are demonstrated empirically in Section 4. Section 5 gives a review of relevant literature, including a detailed comparison of our results against those of Bartlett et al. (2018). Finally, Section 6 concludes. 2 GRADIENT DESCENT FOR DEEP LINEAR NEURAL NETWORKS\nWe denote by \u2016v\u2016 the Euclidean norm of a vector v, and by \u2016A\u2016F the Frobenius norm of a matrixA.\nWe are given a training set {(x(i),y(i))}mi=1 \u2282 Rdx \u00d7 Rdy , and would like to learn a hypothesis (predictor) from a parametric familyH := {h\u03b8 : Rdx \u2192 Rdy | \u03b8 \u2208 \u0398} by minimizing the `2 loss:1\nmin \u03b8\u2208\u0398\nL(\u03b8) := 1\n2m \u2211m i=1 \u2016h\u03b8(x(i))\u2212 y(i)\u20162 .\nWhen the parametric family in question is the class of linear predictors, i.e. H = {x 7\u2192 Wx |W \u2208 Rdy\u00d7dx}, the training loss may be written as L(W ) = 12m\u2016WX \u2212 Y \u2016 2 F , where X \u2208 Rdx\u00d7m and Y \u2208 Rdy\u00d7m are matrices whose columns hold instances and labels respectively. Suppose now that the dataset is whitened, i.e. has been transformed such that the empirical (uncentered) covariance matrix for instances \u2014 \u039bxx := 1mXX\n> \u2208 Rdx\u00d7dx \u2014 is equal to identity. Standard calculations (see Appendix A) show that in this case:\nL(W ) = 1\n2 \u2016W \u2212 \u039byx\u20162F + c , (1)\n1Much of the analysis in this paper can be extended to loss types other than `2. In particular, the notion of deficiency margin (Definition 2) can be generalized to account for any convex loss, and, so long as the loss is differentiable, a convergence result analogous to Theorem 1 will hold in the idealized setting of perfect initial balancedness and infinitesimally small learning rate (see proof of Lemma 1). We leave to future work treatment of approximate balancedness and discrete updates in this general setting.\nwhere \u039byx := 1mY X > \u2208 Rdy\u00d7dx is the empirical (uncentered) cross-covariance matrix between instances and labels, and c is a constant (that does not depend onW ). Denoting \u03a6 := \u039byx for brevity, we have that for linear models, minimizing `2 loss over whitened data is equivalent to minimizing the squared Frobenius distance from a target matrix \u03a6:\nminW\u2208Rdy\u00d7dx L 1(W ) :=\n1 2 \u2016W \u2212 \u03a6\u20162F . (2)\nOur interest in this work lies on linear neural networks \u2014 fully-connected neural networks with linear activation. A depth-N (N \u2208 N) linear neural network with hidden widths d1, . . . , dN\u22121 \u2208 N corresponds to the parametric family of hypotheses H := {x 7\u2192 WNWN\u22121 \u00b7 \u00b7 \u00b7W1x |Wj \u2208 Rdj\u00d7dj\u22121 , j = 1, . . . , N}, where d0 := dx, dN := dy . Similarly to the case of a (directly parameterized) linear predictor (Equation (2)), with a linear neural network, minimizing `2 loss over whitened data can be cast as squared Frobenius approximation of a target matrix \u03a6:\nminWj\u2208Rdj\u00d7dj\u22121 , j=1,...,N L N (W1, . . . ,WN ) :=\n1 2 \u2016WNWN\u22121 \u00b7 \u00b7 \u00b7W1 \u2212 \u03a6\u20162F . (3)\nNote that the notation LN (\u00b7) is consistent with that of Equation (2), as a network with depth N = 1 precisely reduces to a (directly parameterized) linear model.\nWe focus on studying the process of training a deep linear neural network by gradient descent, i.e. of tackling the optimization problem in Equation (3) by iteratively applying the following updates:\nWj(t+ 1)\u2190Wj(t)\u2212 \u03b7 \u2202LN\n\u2202Wj\n( W1(t), . . . ,WN (t) ) , j = 1, . . . , N , t = 0, 1, 2, . . . , (4)\nwhere \u03b7 > 0 is a configurable learning rate. In the case of depth N = 1, the training problem in Equation (3) is smooth and strongly convex, thus it is known (cf. Boyd and Vandenberghe (2004)) that with proper choice of \u03b7, gradient descent converges to global minimum at a linear rate. In contrast, for any depth greater than 1, Equation (3) comprises a fundamentally non-convex program, and the convergence properties of gradient descent are highly non-trivial. Apart from the caseN = 2 (shallow network), one cannot hope to prove convergence via landscape arguments, as the strict saddle property is provably violated (see Section 1). We will see in Section 3 that a direct analysis of the trajectories taken by gradient descent can succeed in this arena, providing a guarantee for linear rate convergence to global minimum.\nWe close this section by introducing additional notation that will be used in our analysis. For an arbitrary matrix A, we denote by \u03c3max(A) and \u03c3min(A) its largest and smallest (respectively) singular values.2 For d \u2208 N, we use Id to signify the identity matrix in Rd\u00d7d. Given weights W1, . . . ,WN of a linear neural network, we let W1:N be the direct parameterization of the end-to-end linear mapping realized by the network, i.e. W1:N := WNWN\u22121 \u00b7 \u00b7 \u00b7W1. Note that LN (W1, . . . ,WN ) = L\n1(W1:N ), meaning the loss associated with a depth-N network is equal to the loss of the corresponding end-to-end linear model. In the context of gradient descent, we will oftentimes use `(t) as shorthand for the loss at iteration t:\n`(t) := LN (W1(t), . . . ,WN (t)) = L 1(W1:N (t)) . (5) 3 CONVERGENCE ANALYSIS\nIn this section we establish convergence of gradient descent for deep linear neural networks (Equations (4) and (3)) by directly analyzing the trajectories taken by the algorithm. We begin in Subsection 3.1 with a presentation of two concepts central to our analysis: approximate balancedness and deficiency margin. These facilitate our main convergence theorem, delivered in Subsection 3.2. We conclude in Subsection 3.3 by deriving a convergence guarantee that holds with constant probability over a random initialization. 3.1 APPROXIMATE BALANCEDNESS AND DEFICIENCY MARGIN\nIn our context, the notion of approximate balancedness is formally defined as follows:\n2If A \u2208 Rd\u00d7d \u2032 , \u03c3min(A) stands for the min{d, d\u2032}-th largest singular value. Recall that singular values\nare always non-negative.\nDefinition 1. For \u03b4 \u2265 0, we say that the matrices Wj \u2208 Rdj\u00d7dj\u22121 , j=1, . . . , N , are \u03b4-balanced if:\u2225\u2225W>j+1Wj+1 \u2212WjW>j \u2225\u2225F \u2264 \u03b4 , \u2200j \u2208 {1, . . . , N \u2212 1} . Note that in the case of 0-balancedness, i.e. W>j+1Wj+1 = WjW > j , \u2200j \u2208 {1, . . . , N \u2212 1}, all matrices Wj share the same set of non-zero singular values. Moreover, as shown in the proof of Theorem 1 in Arora et al. (2018), this set is obtained by taking the N -th root of each non-zero singular value in the end-to-end matrix W1:N . We will establish approximate versions of these facts for \u03b4-balancedness with \u03b4 > 0, and admit their usage by showing that if the weights of a linear neural network are initialized to be approximately balanced, they will remain that way throughout the iterations of gradient descent. The condition of approximate balancedness at initialization is trivially met in the special case of linear residual networks (d0 = \u00b7 \u00b7 \u00b7 = dN = d and W1(0) = \u00b7 \u00b7 \u00b7 = WN (0) = Id). Moreover, as Claim 2 in Appendix B shows, for a given \u03b4 > 0, the customary initialization via random Gaussian distribution with mean zero leads to approximate balancedness with high probability if the standard deviation is sufficiently small.\nThe second concept we introduce \u2014 deficiency margin \u2014 refers to how far a ball around the target is from containing rank-deficient (i.e. low rank) matrices.\nDefinition 2. Given a target matrix \u03a6 \u2208 RdN\u00d7d0 and a constant c > 0, we say that a matrix W \u2208 RdN\u00d7d0 has deficiency margin c with respect to \u03a6 if:3\n\u2016W \u2212 \u03a6\u2016F \u2264 \u03c3min(\u03a6)\u2212 c . (6)\nThe term \u201cdeficiency margin\u201d alludes to the fact that if Equation (6) holds, every matrix W \u2032 whose distance from \u03a6 is no greater than that of W , has singular values c-bounded away from zero:\nClaim 1. Suppose W has deficiency margin c with respect to \u03a6. Then, any matrix W \u2032 (of same size as \u03a6 and W ) for which \u2016W \u2032 \u2212 \u03a6\u2016F \u2264 \u2016W \u2212 \u03a6\u2016F satisfies \u03c3min(W \u2032) \u2265 c.\nProof. Our proof relies on the inequality \u03c3min(A+B) \u2265 \u03c3min(A)\u2212\u03c3max(B) \u2014 see Appendix D.1.\nWe will show that if the weights W1, . . . ,WN are initialized such that (they are approximately balanced and) the end-to-end matrix W1:N has deficiency margin c > 0 with respect to the target \u03a6, convergence of gradient descent to global minimum is guaranteed.4 Moreover, the convergence will outpace a particular rate that gets faster when c grows larger. This suggests that from a theoretical perspective, it is advantageous to initialize a linear neural network such that the end-to-end matrix has a large deficiency margin with respect to the target. Claim 3 in Appendix B provides information on how likely deficiency margins are in the case of a single output model (scalar regression) subject to customary zero-centered Gaussian initialization. It shows in particular that if the standard deviation of the initialization is sufficiently small, the probability of a deficiency margin being met is close to 0.5; on the other hand, for this deficiency margin to have considerable magnitude, a non-negligible standard deviation is required.\nTaking into account the need for both approximate balancedness and deficiency margin at initialization, we observe a delicate trade-off under the common setting of Gaussian perturbations around zero: if the standard deviation is small, it is likely that weights be highly balanced and a deficiency margin be met; however overly small standard deviation will render high magnitude for the deficiency margin improbable, and therefore fast convergence is less likely to happen; on the opposite end, large standard deviation jeopardizes both balancedness and deficiency margin, putting the entire convergence at risk. This trade-off is reminiscent of empirical phenomena in deep learning, by\n3Note that deficiency margin c > 0 with respect to \u03a6 implies \u03c3min(\u03a6) > 0, i.e. \u03a6 has full rank. Our analysis can be extended to account for rank-deficient \u03a6 by replacing \u03c3min(\u03a6) in Equation (6) with the smallest positive singular value of \u03a6, and by requiring that the end-to-end matrix W1:N be initialized such that its left and right null spaces coincide with those of \u03a6. Relaxation of this requirement is a direction for future work.\n4In fact, a deficiency margin implies that all critical points in the respective sublevel set (set of points with smaller loss value) are global minima. This however is far from sufficient for proving convergence, as sublevel sets are unbounded, and the loss landscape over them is non-convex and non-smooth. Indeed, we show in Appendix C that deficiency margin alone is not enough to ensure convergence \u2014 without approximate balancedness, the lack of smoothness can cause divergence.\nwhich small initialization can bring forth efficient convergence, while if exceedingly small, rate of convergence may plummet (\u201cvanishing gradient problem\u201d), and if made large, divergence becomes inevitable (\u201cexploding gradient problem\u201d). The common resolution of residual connections (He et al., 2016) is analogous in our context to linear residual networks, which ensure perfect balancedness, and allow large deficiency margin if the target is not too far from identity. 3.2 MAIN THEOREM\nUsing approximate balancedness (Definition 1) and deficiency margin (Definition 2), we present our main theorem \u2014 a guarantee for linear convergence to global minimum:\nTheorem 1. Assume that gradient descent is initialized such that the end-to-end matrix W1:N (0) has deficiency margin c > 0 with respect to the target \u03a6, and the weights W1(0), . . . ,WN (0) are \u03b4-balanced with \u03b4 = c2 /( 256 \u00b7N3 \u00b7 \u2016\u03a6\u20162(N\u22121)/NF ) . Suppose also that the learning rate \u03b7 meets:\n\u03b7 \u2264 c (4N\u22122)/N\n6144 \u00b7N3 \u00b7 \u2016\u03a6\u2016(6N\u22124)/NF . (7) Then, for any > 0 and:\nT \u2265 1 \u03b7 \u00b7 c2(N\u22121)/N\n\u00b7 log ( `(0) ) , (8)\nthe loss at iteration T of gradient descent \u2014 `(T ) \u2014 is no greater than . 3.2.1 ON THE ASSUMPTIONS MADE\nThe assumptions made in Theorem 1 \u2014 approximate balancedness and deficiency margin at initialization \u2014 are both necessary, in the sense that violating any one of them may lead to convergence failure. We demonstrate this in Appendix C. In the special case of linear residual networks (uniform dimensions and identity initialization), a sufficient condition for the assumptions to be met is that the target matrix have (Frobenius) distance less than 0.5 from identity. This strengthens one of the central results in Bartlett et al. (2018) (see Section 5). For a setting of random near-zero initialization, we present in Subsection 3.3 a scheme that, when the output dimension is 1 (scalar regression), ensures assumptions are satisfied (and therefore gradient descent efficiently converges to global minimum) with constant probability. It is an open problem to fully analyze gradient descent under the common initialization scheme of zero-centered Gaussian perturbations applied to each layer independently. We treat this scenario in Appendix B, providing quantitative results concerning the likelihood of each assumption (approximate balancedness or deficiency margin) being met individually. However the question of how likely it is that both assumptions be met simultaneously, and how that depends on the standard deviation of the Gaussian, is left for future work.\nAn additional point to make is that Theorem 1 poses a structural limitation on the linear neural network. Namely, it requires the dimension of each hidden layer (di, i = 1, . . . , N\u22121) to be greater than or equal to the minimum between those of the input (d0) and output (dN ). Indeed, in order for the initial end-to-end matrixW1:N (0) to have deficiency margin c > 0, it must (by Claim 1) have full rank, and this is only possible if there is no intermediate dimension di smaller than min{d0, dN}. We make no other assumptions on network architecture (depth, input/output/hidden dimensions). 3.2.2 PROOF\nThe cornerstone upon which Theorem 1 rests is the following lemma, showing non-trivial descent whenever \u03c3min(W1:N ) is bounded away from zero:\nLemma 1. Under the conditions of Theorem 1, we have that for every t = 0, 1, 2, . . . :5\n`(t+ 1) \u2264 `(t)\u2212 \u03b7 2 \u00b7 \u03c3min\n( W1:N (t) ) 2(N\u22121) N \u00b7 \u2225\u2225\u2225\u2225dL1dW (W1:N (t)) \u2225\u2225\u2225\u22252 F . (9)\n5Note that the term dL 1\ndW (W1:N (t)) below stands for the gradient of L1(\u00b7) \u2014 a convex loss over (directly\nparameterized) linear models (Equation (2)) \u2014 at the point W1:N (t) \u2014 the end-to-end matrix of the network at iteration t. It is therefore (see Equation (5)) non-zero anywhere but at a global minimum.\nProof of Lemma 1 (in idealized setting; for complete proof see Appendix D.2). We prove the lemma here for the idealized setting of perfect initial balancedness (\u03b4 = 0):\nW>j+1(0)Wj+1(0) = Wj(0)W > j (0) , \u2200j \u2208 {1, . . . , N \u2212 1} ,\nand infinitesimally small learning rate (\u03b7 \u2192 0+) \u2014 gradient flow:\nW\u0307j(\u03c4) = \u2212 \u2202LN\n\u2202Wj\n( W1(\u03c4), . . . ,WN (\u03c4) ) , j = 1, . . . , N , \u03c4 \u2208 [0,\u221e) ,\nwhere \u03c4 is a continuous time index, and dot symbol (in W\u0307j(\u03c4)) signifies derivative with respect to time. The complete proof, for the realistic case of approximate balancedness and discrete updates (\u03b4, \u03b7 > 0), is similar but much more involved, and appears in Appendix D.2.\nRecall that `(t) \u2014 the objective value at iteration t of gradient descent \u2014 is equal to L1(W1:N (t)) (see Equation (5)). Accordingly, for the idealized setting in consideration, we would like to show:\nd\nd\u03c4 L1 (W1:N (\u03c4)) \u2264 \u2212\n1 2 \u03c3min\n( W1:N (\u03c4) ) 2(N\u22121) N \u00b7 \u2225\u2225\u2225\u2225dL1dW (W1:N (\u03c4)) \u2225\u2225\u2225\u22252 F . (10)\nWe will see that a stronger version of Equation (10) holds, namely, one without the 1/2 factor (which only appears due to discretization).\nBy (Theorem 1 and Claim 1 in) Arora et al. (2018), the weightsW1(\u03c4), . . . ,WN (\u03c4) remain balanced throughout the entire optimization, and that implies the end-to-end matrixW1:N (\u03c4) moves according to the following differential equation:\nvec ( W\u03071:N (\u03c4) ) = \u2212PW1:N (\u03c4) \u00b7 vec ( dL1\ndW (W1:N (\u03c4))\n) , (11)\nwhere vec(A), for an arbitrary matrix A, stands for vectorization in column-first order, and PW1:N (\u03c4) is a positive semidefinite matrix whose eigenvalues are all greater than or equal to \u03c3min(W1:N (\u03c4))\n2(N\u22121)/N . Taking the derivative of L1(W1:N (\u03c4)) with respect to time, we obtain the sought-after Equation (10) (with no 1/2 factor):\nd\nd\u03c4 L1 (W1:N (\u03c4)) =\n\u2329 vec ( dL1\ndW\n( W1:N (\u03c4) )) , vec ( W\u03071:N (\u03c4) )\u232a = \u2329 vec ( dL1\ndW\n( W1:N (\u03c4) )) ,\u2212PW1:N (\u03c4) \u00b7 vec ( dL1\ndW (W1:N (\u03c4)) )\u232a \u2264 \u2212\u03c3min ( W1:N (\u03c4) ) 2(N\u22121) N \u00b7 \u2225\u2225\u2225\u2225vec(dL1dW (W1:N (\u03c4)) )\u2225\u2225\u2225\u22252 = \u2212\u03c3min ( W1:N (\u03c4) ) 2(N\u22121) N \u00b7 \u2225\u2225\u2225\u2225dL1dW (W1:N (\u03c4)) \u2225\u2225\u2225\u22252 F .\nThe first transition here (equality) is an application of the chain rule; the second (equality) plugs in Equation (11); the third (inequality) results from the fact that the eigenvalues of the symmetric matrix PW1:N (\u03c4) are no smaller than \u03c3min(W1:N (\u03c4))\n2(N\u22121)/N (recall that \u2016\u00b7\u2016 stands for Euclidean norm); and the last (equality) is trivial \u2014 \u2016A\u2016F = \u2016vec(A)\u2016 for any matrix A.\nWith Lemma 1 established, the proof of Theorem 1 readily follows:\nProof of Theorem 1. By the definition of L1(\u00b7) (Equation (2)), for any W \u2208 RdN\u00d7d0 :\ndL1 dW (W ) = W \u2212 \u03a6 =\u21d2\n\u2225\u2225\u2225\u2225dL1dW (W ) \u2225\u2225\u2225\u22252 F = 2 \u00b7 L1(W ) .\nPlugging this into Equation (9) while recalling that `(t) = L1(W1:N (t)) (Equation (5)), we have (by Lemma 1) that for every t = 0, 1, 2, . . . :\nL1 ( W1:N (t+ 1) ) \u2264 L1 ( W1:N (t) ) \u00b7 ( 1\u2212 \u03b7 \u00b7 \u03c3min ( W1:N (t) ) 2(N\u22121) N ) .\nSince the coefficients 1 \u2212 \u03b7 \u00b7 \u03c3min(W1:N (t)) 2(N\u22121) N are necessarily non-negative (otherwise would contradict non-negativity of L1(\u00b7)), we may unroll the inequalities, obtaining:\nL1 ( W1:N (t+ 1) ) \u2264 L1 ( W1:N (0) ) \u00b7 \u220ft\nt\u2032=0\n( 1\u2212 \u03b7 \u00b7 \u03c3min ( W1:N (t \u2032) ) 2(N\u22121) N ) . (12)\nNow, this in particular means that for every t\u2032 = 0, 1, 2, . . . : L1 ( W1:N (t \u2032) ) \u2264 L1 ( W1:N (0) ) =\u21d2 \u2016W1:N (t\u2032)\u2212 \u03a6\u2016F \u2264 \u2016W1:N (0)\u2212 \u03a6\u2016F .\nDeficiency margin c of W1:N (0) along with Claim 1 thus imply \u03c3min ( W1:N (t \u2032) ) \u2265 c, which when inserted back into Equation (12) yields, for every t = 1, 2, 3, . . . :\nL1 ( W1:N (t) ) \u2264 L1 ( W1:N (0) ) \u00b7 ( 1\u2212 \u03b7 \u00b7 c 2(N\u22121) N )t . (13)\n\u03b7 \u00b7 c 2(N\u22121) N is obviously non-negative, and it is also no greater than 1 (otherwise would contradict non-negativity of L1(\u00b7)). We may therefore incorporate the inequality 1\u2212 \u03b7 \u00b7 c2(N\u22121)/N \u2264 exp ( \u2212\n\u03b7 \u00b7 c2(N\u22121)/N ) into Equation (13):\nL1 ( W1:N (t) ) \u2264 L1 ( W1:N (0) ) \u00b7 exp ( \u2212 \u03b7 \u00b7 c2(N\u22121)/N \u00b7 t ) ,\nfrom which it follows that L1(W1:N (t)) \u2264 if:\nt \u2265 1 \u03b7 \u00b7 c2(N\u22121)/N\n\u00b7 log ( L1(W1:N (0)) ) .\nRecalling again that `(t) = L1(W1:N (t)) (Equation (5)), we conclude the proof. 3.3 BALANCED INITIALIZATION\nWe define the following procedure, balanced initialization, which assigns weights randomly while ensuring perfect balancedness:\nProcedure 1 (Balanced initialization). Given d0, d1, . . . , dN \u2208 N such that min{d1, . . . , dN\u22121} \u2265 min{d0, dN} and a distribution D over dN \u00d7 d0 matrices, a balanced initialization of Wj \u2208 Rdj\u00d7dj\u22121 , j=1, . . . , N , assigns these weights as follows:\n(i) Sample A \u2208 RdN\u00d7d0 according to D. (ii) Take singular value decomposition A = U\u03a3V >, where U \u2208 RdN\u00d7min{d0,dN}, V \u2208\nRd0\u00d7min{d0,dN} have orthonormal columns, and \u03a3 \u2208 Rmin{d0,dN}\u00d7min{d0,dN} is diagonal and holds the singular values of A.\n(iii) Set WN ' U\u03a31/N ,WN\u22121 ' \u03a31/N , . . . ,W2 ' \u03a31/N ,W1 ' \u03a31/NV >, where the symbol \u201c'\u201d stands for equality up to zero-valued padding.6 7\nThe concept of balanced initialization, together with Theorem 1, leads to a guarantee for linear convergence (applicable to output dimension 1 \u2014 scalar regression) that holds with constant probability over the randomness in initialization:\nTheorem 2. For any constant 0 < p < 1/2, there are constants d\u20320, a > 0 8 such that the following holds. Assume dN = 1, d0 \u2265 d\u20320, and that the weights W1(0), . . . ,WN (0) are subject to balanced initialization (Procedure 1) such that the entries in W1:N (0) are independent zero-centered Gaussian perturbations with standard deviation s \u2264 \u2016\u03a6\u20162/ \u221a ad20. Suppose also that we run gradient\n6These assignments can be accomplished since min{d1, . . . , dN\u22121} \u2265 min{d0, dN}. 7By design W1:N = A and W>j+1Wj+1 = WjW > j , \u2200j \u2208 {1, . . . , N\u22121}\u2014 these properties are actually all we need in Theorem 2, and step (iii) in Procedure 1 can be replaced by any assignment that meets them. 8As shown in the proof of the theorem (Appendix D.3), d\u20320, a > 0 can take on any pair of values for\nwhich: (i) d\u20320 \u2265 20; and (ii) ( 1 \u2212 2 exp(\u2212d\u20320/16) )( 3 \u2212 4F (2/ \u221a a/2) ) \u2265 2p, where F (\u00b7) stands for the cumulative distribution function of the standard normal distribution. For example, if p = 0.25, it suffices to take any d\u20320 \u2265 100, a \u2265 100. We note that condition (i) here (d\u20320 \u2265 20) serves solely for simplification of expressions in the theorem.\ndescent with learning rate \u03b7 \u2264 (s2d0)4\u22122/N /( 105N3\u2016\u03a6\u201610\u22126/N2 ) . Then, with probability at least p over the random initialization, we have that for every > 0 and:\nT \u2265 4 \u03b7\n( ln(4) ( \u2016\u03a6\u20162 s2d0 )2\u22122/N + \u2016\u03a6\u20162/N\u221222 ln(\u2016\u03a6\u201622/(8 )) ) ,\nthe loss at iteration T of gradient descent \u2014 `(T ) \u2014 is no greater than .\nProof. See Appendix D.3. 4 EXPERIMENTS\nBalanced initialization (Procedure 1) possesses theoretical advantages compared with the customary layer-wise independent scheme \u2014 it allowed us to derive a convergence guarantee that holds with constant probability over the randomness of initialization (Theorem 2). In this section we present empirical evidence suggesting that initializing with balancedness may be beneficial in practice as well. For conciseness, some of the details behind our implementation are deferred to Appendix E.\nWe began by experimenting in the setting covered by our analysis \u2014 linear neural networks trained via gradient descent minimization of `2 loss over whitened data. The dataset chosen for the experiment was UCI Machine Learning Repository\u2019s \u201cGas Sensor Array Drift at Different Concentrations\u201d (Vergara et al., 2012; Rodriguez-Lujan et al., 2014). Specifically, we used the dataset\u2019s \u201cEthanol\u201d problem \u2014 a scalar regression task with 2565 examples, each comprising 128 features (one of the largest numeric regression tasks in the repository). Starting with the customary initialization of layer-wise independent random Gaussian perturbations centered at zero, we trained a three layer network (N = 3) with hidden widths (d1, d2) set to 32, and measured the time (number of iterations) it takes to converge (reach training loss within = 10\u22125 from optimum) under different choices of standard deviation for the initialization. To account for the possibility of different standard deviations requiring different learning rates (values for \u03b7), we applied, for each standard deviation independently, a grid search over learning rates, and recorded the one that led to fastest convergence. The result of this test is presented in Figure 1(a). As can be seen, there is a range of standard deviations that leads to fast convergence (a few hundred iterations or less), below and above which optimization decelerates by orders of magnitude. This accords with our discussion at the end of Subsection 3.3, by which overly small initialization ensures approximate balancedness (small \u03b4; see Definition 1) but diminishes deficiency margin (small c; see Definition 2) \u2014 \u201cvanishing gradient problem\u201d \u2014 whereas large initialization hinders both approximate balancedness and deficiency margin \u2014 \u201cexploding gradient problem\u201d. In that regard, as a sanity test for the validity of our analysis, in a case where approximate balancedness is met at initialization (small standard deviation), we measured its persistence throughout optimization. As Figure 1(c) shows, our theoretical findings manifest themselves here \u2014 trajectories of gradient descent indeed preserve weight balancedness.\nIn addition to a three layer network, we also evaluated a deeper, eight layer model (with hidden widths identical to the former \u2014 N = 8, d1 = \u00b7 \u00b7 \u00b7 = d7 = 32). In particular, using the same experimental protocol as above, we measured convergence time under different choices of standard deviation for the initialization. Figure 1(a) displays the result of this test alongside that of the three layer model. As the figure shows, transitioning from three layers to eight aggravated the instability with respect to initialization \u2014 there is now a narrow band of standard deviations that lead to convergence in reasonable time, and outside of this band convergence is extremely slow, to the point where it does not take place within the duration we allowed (106 iterations). From the perspective of our analysis, a possible explanation for the aggravation is as follows: under layer-wise independent initialization, the magnitude of the end-to-end matrix W1:N depends on the standard deviation in a manner that is exponential in depth, thus for large depths the range of standard deviations that lead to moderately sized W1:N (as required for a deficiency margin) is limited, and within this range, there may not be many standard deviations small enough to ensure approximate balancedness. The procedure of balanced initialization (Procedure 1) circumvents these difficulties \u2014 it assigns W1:N directly (no exponential dependence on depth), and distributes its content between the individual weights W1, . . . ,WN in a perfectly balanced fashion. Rerunning the experiment of Figure 1(a) with this initialization replacing the customary layer-wise scheme (using same experimental protocol), we obtained the results shown in Figure 1(b) \u2014 both the original three layer network, and the deeper eight layer model, converged quickly under virtually all standard deviations tried.\nAs a final experiment, we evaluated the effect of balanced initialization in a setting that involves non-linear activation, softmax-cross-entropy loss and stochastic optimization (factors not accounted for by our analysis). For this purpose, we turned to the MNIST tutorial built into TensorFlow (Abadi et al., 2016),9 which comprises a fully-connected neural network with two hidden layers (width 128 followed by 32) and ReLU activation (Nair and Hinton, 2010), trained through stochastic gradient descent (over softmax-cross-entropy loss) with batch size 100, initialized via customary layer-wise independent Gaussian perturbations centered at zero. While keeping the learning rate at its default value 0.01, we varied the standard deviation of initialization, and for each value measured the training loss after 10 epochs.10 We then replaced the original (layer-wise independent) initialization with a balanced initialization based on Gaussian perturbations centered at zero (latter was implemented per Procedure 1, disregarding non-linear activation), and repeated the process. The results of this experiment are shown in Figure 1(d). Although our theoretical analysis does not cover non-linear activation, softmax-cross-entropy loss or stochasticity in optimization, its conclusion of balanced initialization leading to improved (faster and more stable) convergence carried over to such setting. 6 CONCLUSION\nFor deep linear neural networks, we have rigorously proven convergence of gradient descent to global minima, at a linear rate, provided that the initial weight matrices are approximately balanced and the initial end-to-end matrix has positive deficiency margin. The result applies to networks with arbitrary depth, and any configuration of input/output/hidden dimensions that supports full rank, i.e. in which no hidden layer has dimension smaller than both the input and output.\nOur assumptions on initialization \u2014 approximate balancedness and deficiency margin \u2014 are both necessary, in the sense that violating any one of them may lead to convergence failure, as we demonstrated explicitly. Moreover, for networks with output dimension 1 (scalar regression), we have shown that a balanced initialization, i.e. a random choice of the end-to-end matrix followed by a balanced partition across all layers, leads assumptions to be met, and thus convergence to take place, with constant probability. Rigorously proving efficient convergence with significant probability under customary layer-wise independent initialization remains an open problem. The recent work of Shamir (2018) suggests that this may not be possible, as at least in some settings, the number of iterations required for convergence is exponential in depth with overwhelming probability. This negative result, a theoretical manifestation of the \u201cvanishing gradient problem\u201d, is circumvented by balanced initialization. Through simple experiments we have shown that the latter can lead to favorable convergence in deep learning practice, as it does in theory. Further investigation of balanced initialization, including development of variants for convolutional layers, is regarded as a promising direction for future research.\nThe analysis in this paper uncovers special properties of the optimization landscape in the vicinity of gradient descent trajectories. We expect similar ideas to prove useful in further study of gradient descent on non-convex objectives, including training losses of deep non-linear neural networks.\n11There is actually an additional third scenario being treated \u2014 \u03a6 is asymmetric and positive definite \u2014 but since that requires a dedicated optimization algorithm, it is outside our scope.\n121/10e is the optimal (largest) distance that may be obtained (via careful choice of constants) from the proof of Theorem 1 in Bartlett et al. (2018). ACKNOWLEDGMENTS\nThis work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. Nadav Cohen is a member of the Zuckerman Israeli Postdoctoral Scholars Program, and is supported by Schmidt Foundation. A `2 LOSS OVER WHITENED DATA\nRecall the `2 loss of a linear predictor W \u2208 Rdy\u00d7dx as defined in Section 2:\nL(W ) = 1\n2m \u2016WX \u2212 Y \u20162F ,\nwhere X \u2208 Rdx\u00d7m and Y \u2208 Rdy\u00d7m. Define \u039bxx := 1mXX > \u2208 Rdx\u00d7dx , \u039byy := 1mY Y > \u2208 Rdy\u00d7dy and \u039byx := 1mY X\n> \u2208 Rdy\u00d7dx . Using the relation \u2016A\u20162F = Tr(AA>), we have: L(W ) = 12m Tr ( (WX \u2212 Y )(WX \u2212 Y )> ) = 12m Tr(WXX >W>)\u2212 1m Tr(WXY >) + 12m Tr(Y Y >)\n= 12 Tr(W\u039bxxW >)\u2212 Tr(W\u039b>yx) + 12 Tr(\u039byy) .\nBy definition, when data is whitened, \u039bxx is equal to identity, yielding: L(W ) = 12 Tr(WW >)\u2212 Tr(W\u039b>yx) + 12 Tr(\u039byy)\n= 12 Tr ( (W \u2212 \u039byx)(W \u2212 \u039byx)> ) \u2212 12 Tr(\u039byx\u039b > yx) + 1 2 Tr(\u039byy)\n= 12 \u2016W \u2212 \u039byx\u2016 2 F + c ,\nwhere c := \u2212 12 Tr(\u039byx\u039b > yx) + 1 2 Tr(\u039byy) does not depend on W . Hence we arrive at Equation (1). B APPROXIMATE BALANCEDNESS AND DEFICIENCY MARGIN UNDER\nCUSTOMARY INITIALIZATION\nTwo assumptions concerning initialization facilitate our main convergence result (Theorem 1): (i) the initial weights W1(0), . . . ,WN (0) are approximately balanced (see Definition 1); and (ii) the initial end-to-end matrix W1:N (0) has positive deficiency margin with respect to the target \u03a6 (see Definition 2). The current appendix studies the likelihood of these assumptions being met under customary initialization of random (layer-wise independent) Gaussian perturbations centered at zero.\nFor approximate balancedness we have the following claim, which shows that it becomes more and more likely the smaller the standard deviation of initialization is:\nClaim 2. Assume all entries in the matrices Wj \u2208 Rdj\u00d7dj\u22121 , j = 1, . . . , N , are drawn independently at random from a Gaussian distribution with mean zero and standard deviation s > 0. Then, for any \u03b4 > 0, the probability of W1, . . . ,WN being \u03b4-balanced is at least max{0, 1 \u2212 10\u03b4\u22122Ns4d3max}, where dmax := max{d0, . . . , dN}.\nProof. See Appendix D.4.\nIn terms of deficiency margin, the claim below treats the case of a single output model (scalar regression), and shows that if the standard deviation of initialization is sufficiently small, with probability close to 0.5, a deficiency margin will be met. However, for this deficiency margin to meet a chosen threshold c, the standard deviation need be sufficiently large.\nClaim 3. There is a constant C1 > 0 such that the following holds. Consider the case where dN = 1, d0 \u2265 20,13 and suppose all entries in the matrices Wj \u2208 Rdj\u00d7dj\u22121 , j = 1, . . . , N , are drawn independently at random from a Gaussian distribution with mean zero, whose standard deviation s > 0 is small with respect to the target, i.e. s \u2264 \u2016\u03a6\u20161/NF / (105d30d1 \u00b7 \u00b7 \u00b7 dN\u22121C1)1/(2N). Then,\nfor any c with 0 < c \u2264 \u2016\u03a6\u2016F /( 105d30C1(C1N) 2N ) , the probability of the end-to-end matrix W1:N having deficiency margin c with respect to \u03a6 is at least 0.49 if: 14 15\ns \u2265 c1/(2N) \u00b7 ( C1N \u2016\u03a6\u20161/(2N)F /(d1 \u00b7 \u00b7 \u00b7 dN\u22121) 1/(2N) ) .\nProof. See Appendix D.5. 13The requirement d0 \u2265 20 is purely technical, designed to simplify expressions in the claim. 14The probability 0.49 can be increased to any p < 1/2 by increasing the constant 105 in the upper bounds for s and c. 15It is not difficult to see that the latter threshold is never greater than the upper bound for s, thus sought-after standard deviations always exist. C CONVERGENCE FAILURES\nIn this appendix we show that the assumptions on initialization facilitating our main convergence result (Theorem 1) \u2014 approximate balancedness and deficiency margin \u2014 are both necessary, by demonstrating cases where violating each of them leads to convergence failure. This accords with widely observed empirical phenomena, by which successful optimization in deep learning crucially depends on careful initialization (cf. Sutskever et al. (2013)).\nClaim 4 below shows16 that if one omits from Theorem 1 the assumption of approximate balancedness at initialization, no choice of learning rate can guarantee convergence:\nClaim 4. Assume gradient descent with some learning rate \u03b7 > 0 is a applied to a network whose depth N is even, and whose input, output and hidden dimensions d0, . . . , dN are all equal to some d \u2208 N. Then, there exist target matrices \u03a6 such that the following holds. For any c with 0 < c < \u03c3min(\u03a6), there are initializations for which the end-to-end matrix W1:N (0) has deficiency margin c with respect to \u03a6, and yet convergence will fail \u2014 objective will never go beneath a positive constant.\nProof. See Appendix D.6.\nIn terms of deficiency margin, we provide (by adapting Theorem 4 in Bartlett et al. (2018)) a different, somewhat stronger result \u2014 there exist settings where initialization violates the assumption of deficiency margin, and despite being perfectly balanced, leads to convergence failure, for any choice of learning rate:17\nClaim 5. Consider a network whose depth N is even, and whose input, output and hidden dimensions d0, . . . , dN are all equal to some d \u2208 N. Then, there exist target matrices \u03a6 for which there are non-stationary initializations W1(0), . . . ,WN (0) that are 0-balanced, and yet lead gradient descent, under any learning rate, to fail \u2014 objective will never go beneath a positive constant.\nProof. See Appendix D.7. D DEFERRED PROOFS\nWe introduce some additional notation here in addition to the notation specified in Section 2. We use \u2016A\u2016\u03c3 to denote the spectral norm (largest singular value) of a matrix A, and sometimes \u2016v\u20162 as an alternative to \u2016v\u2016\u2014 the Euclidean norm of a vector v. Recall that for a matrix A, vec(A) is its vectorization in column-first order. We let F (\u00b7) denote the cumulative distribution function of the standard normal distribution, i.e. F (x) = \u222b x \u2212\u221e 1\u221a 2\u03c0 e\u2212 1 2u 2 du (x \u2208 R).\nTo simplify the presentation we will oftentimes use W as an alternative (shortened) notation for W1:N \u2014 the end-to-end matrix of a linear neural network. We will also use L(\u00b7) as shorthand for L1(\u00b7) \u2014 the loss associated with a (directly parameterized) linear model, i.e. L(W ) := 1 2 \u2016W \u2212 \u03a6\u2016 2 F . Therefore, in the context of gradient descent training a linear neural network, the following expressions all represent the loss at iteration t:\n`(t) = LN (W1(t), . . . ,WN (t)) = L 1(W1:N (t)) = L 1(W (t)) = L(W (t)) = 1\n2 \u2016W (t)\u2212 \u03a6\u20162F .\nAlso, for weights Wj \u2208 Rdj\u00d7dj\u22121 , j = 1, . . . , N of a linear neural network, we generalize the notation W1:N , and define Wj:j\u2032 := Wj\u2032Wj\u2032\u22121 \u00b7 \u00b7 \u00b7Wj for every 1 \u2264 j \u2264 j\u2032 \u2264 N . Note that W>j:j\u2032 = W > j W > j+1 \u00b7 \u00b7 \u00b7W>j\u2032 . Then, by a simple gradient calculation, the gradient descent updates (4) can be written as\nWj(t+ 1) = Wj(t)\u2212 \u03b7W>j+1:N (t) \u00b7 dL\ndW (W (t)) \u00b7W>1:j\u22121(t) , 1 \u2264 j \u2264 N , (14)\nwhere we define W1:0(t) := Id0 and WN+1:N (t) := IdN for completeness.\n16For simplicity of presentation, the claim treats the case of even depth and uniform dimension across all layers. It can easily be extended to account for arbitrary depth and input/output/hidden dimensions.\n17This statement becomes trivial if one allows initialization at a suboptimal stationary point, e.g. Wj(0) = 0, j = 1, . . . , N . Claim 5 rules out such trivialities by considering only non-stationary initializations.\nFinally, recall the standard definition of the tensor product of two matrices (also known as the Kronecker product): for matrices A \u2208 RmA\u00d7nA , B \u2208 RmB\u00d7nB , their tensor product A \u2297 B \u2208 RmAmB\u00d7nAnB is defined as\nA\u2297B =  a1,1B \u00b7 \u00b7 \u00b7 a1,nAB... . . . ... amA,1B \u00b7 \u00b7 \u00b7 amA,nAB  , where ai,j is the element in the i-th row and j-th column of A. D.1 PROOF OF CLAIM 1\nProof. Recall that for any matrices A and B of compatible sizes \u03c3min(A + B) \u2265 \u03c3min(A) \u2212 \u03c3max(B), and that the Frobenius norm of a matrix is always lower bounded by its largest singular value (Horn and Johnson (1990)). Using these facts, we have:\n\u03c3min(W \u2032) = \u03c3min ( \u03a6 + (W \u2032 \u2212 \u03a6) ) \u2265 \u03c3min(\u03a6)\u2212 \u03c3max(W \u2032 \u2212 \u03a6)\n\u2265 \u03c3min(\u03a6)\u2212 \u2016W \u2032 \u2212 \u03a6\u2016F \u2265 \u03c3min(\u03a6)\u2212 \u2016W \u2212 \u03a6\u2016F \u2265 \u03c3min(\u03a6)\u2212 (\u03c3min(\u03a6)\u2212 c) = c . D.2 PROOF OF LEMMA 1\nTo prove Lemma 1, we will in fact prove a stronger result, Lemma 2 below, which states that for each iteration t, in addition to (9) being satisfied, certain other properties are also satisfied, namely: (i) the weight matrices W1(t), . . . ,WN (t) are 2\u03b4-balanced, and (ii) W1(t), . . . ,WN (t) have bounded spectral norms. Lemma 2. Suppose the conditions of Theorem 1 are satisfied. Then for all t \u2208 N \u222a {0},\n(A(t)) For 1 \u2264 j \u2264 N \u2212 1, \u2016W>j+1(t)Wj+1(t)\u2212Wj(t)W>j (t)\u2016F \u2264 2\u03b4.\n(A\u2032(t)) If t \u2265 1, then for 1 \u2264 j \u2264 N \u2212 1,\n\u2016W>j+1(t)Wj+1(t)\u2212Wj(t)W>j (t)\u2016F \u2264 \u2016W>j+1(t\u2212 1)Wj+1(t\u2212 1)\u2212Wj(t\u2212 1)W>j (t\u2212 1)\u2016F\n+\u03b72 \u2225\u2225\u2225\u2225dL1dW W (t\u2212 1) \u2225\u2225\u2225\u2225 F \u00b7 \u2225\u2225\u2225\u2225dL1dW W (t\u2212 1) \u2225\u2225\u2225\u2225 \u03c3 \u00b7 4 \u00b7 (2\u2016\u03a6\u2016F )2(N\u22121)/N .\n(B(t)) If t = 0, then `(t) \u2264 12\u2016\u03a6\u2016 2 F . If t \u2265 1, then\n`(t) \u2264 `(t\u2212 1)\u2212 \u03b7 2 \u03c3min(W (t\u2212 1)) 2(N\u22121) N \u2225\u2225\u2225\u2225dL1dW (W (t\u2212 1)) \u2225\u2225\u2225\u22252 F .\n(C(t)) For 1 \u2264 j \u2264 N , \u2016Wj(t)\u2016\u03c3 \u2264 (4\u2016\u03a6\u2016F )1/N .\nFirst we observe that Lemma 1 is an immediate consequence of Lemma 2.\nProof of Lemma 1. Notice that condition B(t) of Lemma 2 for each t \u2265 1 immediately establishes the conclusion of Lemma 1 at time step t\u2212 1. D.2.1 PRELIMINARY LEMMAS\nWe next prove some preliminary lemmas which will aid us in the proof of Lemma 2. The first is a matrix inequality that follows from Lidskii\u2019s theorem. For a matrix A, let Sing(A) denote the rectangular diagonal matrix of the same size, whose diagonal elements are the singular values of A arranged in non-increasing order (starting from the (1, 1) position).\nLemma 3 (Bhatia (1997), Exercise IV.3.5). For any two matricesA,B of the same size, \u2016 Sing(A)\u2212 Sing(B)\u2016\u03c3 \u2264 \u2016A\u2212B\u2016\u03c3 and \u2016 Sing(A)\u2212 Sing(B)\u2016F \u2264 \u2016A\u2212B\u2016F .\nUsing Lemma 3, we get: Lemma 4. SupposeD1, D2 \u2208 Rd\u00d7d are non-negative diagonal matrices with non-increasing values along the diagonal and O \u2208 Rd\u00d7d is an orthogonal matrix. Suppose that \u2016D1 \u2212 OD2O>\u2016F \u2264 , for some > 0. Then:\n1. \u2016D1 \u2212OD1O>\u2016F \u2264 2 .\n2. \u2016D1 \u2212D2\u2016F \u2264 .\nProof. Since D1 and OD2OT are both symmetric positive semi-definite matrices, their singular values are equal to their eigenvalues. Moreover, the singular values of D1 are simply its diagonal elements and the singular values of OD2OT are simply the diagonal elements of D2. Thus by Lemma 3 we get that \u2016D1\u2212D2\u2016F \u2264 \u2016D1\u2212OD2OT \u2016F \u2264 . Since the Frobenius norm is unitarily invariant, \u2016D1 \u2212D2\u2016F = \u2016OD1OT \u2212OD2OT \u2016F , and by the triangle inequality it follows that\n\u2016D1 \u2212OD1OT \u2016F \u2264 \u2016OD1OT \u2212OD2OT \u2016F + \u2016D1 \u2212OD2OT \u2016F \u2264 2 .\nLemma 5 below states that if W1, . . . ,WN are approximately balanced matrices, i.e. W>j+1Wj+1 \u2212 WjW > j has small Frobenius norm for 1 \u2264 j \u2264 N \u2212 1, then we can bound the Frobenius distance between W>1:jW1:j and (W > 1 W1) j (as well as between Wj:NW>j:N and (WNW > N ) N\u2212j+1).\nLemma 5. Suppose that dN \u2264 dN\u22121, d0 \u2264 d1, and that for some \u03bd > 0,M > 0, the matrices Wj \u2208 Rdj\u00d7dj\u22121 , 1 \u2264 j \u2264 N satisfy, for 1 \u2264 j \u2264 N \u2212 1,\n\u2016W>j+1Wj+1 \u2212WjW>j \u2016F \u2264 \u03bd, (15)\nand for 1 \u2264 j \u2264 N , \u2016Wj\u2016\u03c3 \u2264M . Then, for 1 \u2264 j \u2264 N ,\n\u2016W>1:jW1:j \u2212 (W>1 W1)j\u2016F \u2264 3\n2 \u03bd \u00b7M2(j\u22121)j2, (16)\nand \u2016Wj:NW>j:N \u2212 (WNW>N )N\u2212j+1\u2016F \u2264 3\n2 \u03bd \u00b7M2(N\u2212j)(N \u2212 j + 1)2. (17)\nMoreover, if \u03c3min denotes the minimum singular value of W1:N , \u03c31,min denotes the minimum singular value of W1 and \u03c3N,min denotes the minimum singular value of WN , then\n\u03c32min \u2212 3\n2 \u03bdM2(N\u22121)N2 \u2264 { \u03c32NN,min : dN \u2265 d0. \u03c32N1,min : dN \u2264 d0.\n(18)\nProof. For 1 \u2264 j \u2264 N , let us write the singular value decomposition of Wj as Wj = Uj\u03a3jV >j , where Uj \u2208 Rdj\u00d7dj and Vj \u2208 Rdj\u22121\u00d7dj\u22121 are orthogonal matrices and \u03a3j \u2208 Rdj\u00d7dj\u22121 is diagonal. We may assume without loss of generality that the singular values of Wj are non-increasing along the diagonal of \u03a3j . Then we can write (15) as\n\u2016Vj+1\u03a3>j+1\u03a3j+1V >j+1 \u2212 Uj\u03a3j\u03a3>j U>j \u2016F \u2264 \u03bd.\nSince the Frobenius norm is invariant to orthogonal transformations, we get that\n\u2016\u03a3>j+1\u03a3j+1 \u2212 V >j+1Uj\u03a3j\u03a3>j U>j Vj+1\u2016F \u2264 \u03bd.\nBy Lemma 4, we have that \u2016\u03a3>j+1\u03a3j+1\u2212\u03a3j\u03a3>j \u2016F \u2264 \u03bd and \u2016\u03a3j\u03a3>j \u2212V >j+1Uj\u03a3j\u03a3>j U>j Vj+1\u2016F \u2264 2\u03bd. We may rewrite the latter of these two inequalities as\n\u2016[\u03a3j\u03a3>j , V >j+1Uj ]\u2016F = \u2016[\u03a3j\u03a3>j , V >j+1Uj ]U>j Vj+1\u2016F = \u2016\u03a3j\u03a3>j \u2212V >j+1Uj\u03a3j\u03a3>j U>j Vj+1\u2016F \u2264 2\u03bd.\nNote that Wj:NW > j:N = Wj+1:NUj\u03a3j\u03a3 > j U > j W > j+1:N .\nFor matrices A,B, we have that \u2016AB\u2016F \u2264 \u2016A\u2016\u03c3 \u00b7 \u2016B\u2016F . Therefore, for j + 1 \u2264 i \u2264 N , we have that\n\u2016Wi:NUi\u22121(\u03a3i\u22121\u03a3>i\u22121)i\u2212jU>i\u22121W>i:N \u2212Wi+1:NUi(\u03a3i\u03a3>i )i\u2212j+1U>i W>i+1:N\u2016F = \u2016Wi+1:NUi ( \u03a3iV > i Ui\u22121(\u03a3i\u22121\u03a3 > i\u22121) i\u2212jU>i\u22121Vi\u03a3 > i \u2212 (\u03a3i\u03a3>i )i\u2212j+1 ) U>i W > i+1:N\u2016F \u2264 \u2016Wi+1:NUi\u03a3i\u20162\u03c3 \u00b7 \u2016(\u03a3i\u22121\u03a3>i\u22121)i\u2212j + [V >i Ui\u22121, (\u03a3i\u22121\u03a3>i\u22121)i\u2212j ]U>i\u22121Vi \u2212 (\u03a3>i \u03a3i)i\u2212j\u2016F \u2264 \u2016Wi:N\u20162\u03c3 ( \u2016[V >i Ui\u22121, (\u03a3i\u22121\u03a3>i\u22121)i\u2212j ]\u2016F + \u2016(\u03a3i\u22121\u03a3>i\u22121)i\u2212j \u2212 (\u03a3>i \u03a3i)i\u2212j\u2016F ) .\nNext, we have that \u2016[V >i Ui\u22121, (\u03a3i\u22121\u03a3>i\u22121)i\u2212j ]\u2016F \u2264 i\u2212j\u22121\u2211 k=0 \u2016(\u03a3i\u22121\u03a3>i\u22121)k[V >i Ui\u22121,\u03a3i\u22121\u03a3>i\u22121](\u03a3i\u22121\u03a3>i\u22121)i\u2212j\u22121\u2212k\u2016F\n\u2264 i\u2212j\u22121\u2211 k=0 \u2016(\u03a3i\u22121\u03a3>i\u22121)i\u2212j\u22121\u2016\u03c3 \u00b7 \u2016[V >i Ui\u22121,\u03a3i\u22121\u03a3>i\u22121]\u2016F \u2264 (i\u2212 j)\u2016Wi\u22121\u20162(i\u2212j\u22121)\u03c3 \u00b7 2\u03bd.\nWe now argue that \u2016(\u03a3i\u22121\u03a3>i\u22121)k \u2212 (\u03a3>i \u03a3i)k\u2016F \u2264 \u03bd \u00b7 kM2(k\u22121). Note that \u2016\u03a3i\u22121\u03a3>i\u22121 \u2212 \u03a3>i \u03a3i\u2016F \u2264 \u03bd, verifying the case k = 1. To see the general case, since square diagonal matrices commute, we have that\n\u2016(\u03a3i\u22121\u03a3>i\u22121)k \u2212 (\u03a3>i \u03a3i)k\u2016F = \u2225\u2225\u2225\u2225\u2225(\u03a3i\u22121\u03a3>i\u22121 \u2212 \u03a3>i \u03a3i) \u00b7 ( k\u22121\u2211 `=0 (\u03a3i\u22121\u03a3 > i\u22121) `(\u03a3>i \u03a3i) k\u22121\u2212` )\u2225\u2225\u2225\u2225\u2225 F\n\u2264 \u03bd \u00b7 k\u22121\u2211 `=0 \u2016Wi\u22121\u20162`\u03c3 \u00b7 \u2016Wi\u20162(k\u2212`\u22121)\u03c3 \u2264 \u03bdkM2(k\u22121).\nIt then follows that\n\u2016Wi:NUi\u22121(\u03a3i\u22121\u03a3>i\u22121)i\u2212jU>i\u22121W>i:N \u2212Wi+1:NUi(\u03a3i\u03a3>i )i\u2212j+1U>i W>i+1:N\u2016F \u2264 \u2016Wi:N\u20162\u03c3 \u00b7 ( (i\u2212 j)M2(i\u2212j\u22121) \u00b7 2\u03bd + \u03bd(i\u2212 j)M2(i\u2212j\u22121) )\n= \u2016Wi:N\u20162\u03c3 \u00b7 3\u03bd(i\u2212 j)M2(i\u2212j\u22121).\nBy the triangle inequality, we then have that\n\u2016Wj:NW>j:N \u2212 UN (\u03a3N\u03a3>N )N\u2212j+1U>N \u2016F\n\u2264 \u03bd N\u2211\ni=j+1\n\u2016Wi:N\u20162\u03c3 \u00b7 3(i\u2212 j)M2(i\u2212j\u22121)\n\u2264 3\u03bd N\u2211\ni=j+1\n(i\u2212 j)M2(N\u2212i+1)M2(i\u2212j\u22121)\n= 3\u03bdM2(N\u2212j) N\u2211\ni=j+1\n(i\u2212 j) \u2264 3 2 \u03bd \u00b7M2(N\u2212j) \u00b7 (N \u2212 j + 1)2. (19)\nBy an identical argument (formally, by replacing Wj with W>N\u2212j+1), we get that\n||W>1:jW1:j \u2212 V1(\u03a3>1 \u03a31)jV >1 \u2016F \u2264 3\n2 \u03bd \u00b7M2(j\u22121) \u00b7 j2. (20)\n(19) and (20) verify (17) and (16), respectively, so it only remains to verify (18).\nLetting j = 1 in (19), we get\n\u2016W1:NW>1:N \u2212 UN (\u03a3N\u03a3>N )NU>N \u2016F \u2264 3\n2 \u03bd \u00b7M2(N\u22121) \u00b7N2. (21)\nLet us write the eigendecomposition of W1:NW>1:N with an orthogonal eigenbasis as W1:NW > 1:N = U\u03a3U>, where \u03a3 is diagonal with its (non-negative) elements arranged in non-increasing order and U is orthogonal. We can write the left hand side of (21) as \u2016U\u03a3U> \u2212 UN (\u03a3N\u03a3>N )NU>N \u2016F = \u2016\u03a3\u2212 U>UN (\u03a3N\u03a3>N )NU>NU\u2016F . By Lemma 4, we have that\n\u2016\u03a3\u2212 (\u03a3N\u03a3>N )N\u2016F \u2264 3\n2 \u03bdM2(N\u22121)N2. (22)\nRecall thatW \u2208 RdN\u00d7d0 . Suppose first that dN \u2264 d0. Let \u03c3min denote the minimum singular value of W1:N (so that \u03c32min is the element in the (dN , dN ) position of \u03a3 \u2208 RdN\u00d7dN ), and \u03c3N,min denote the minimum singular value (i.e. diagonal element) of \u03a3N , which lies in the (dN , dN ) position of \u03a3N . (Note that the (dN , dN ) position of \u03a3N \u2208 RdN\u00d7dN\u22121 exists since dN\u22121 \u2265 dN by assumption.) Then\n(\u03c32NN,min \u2212 \u03c32min)2 \u2264 ( 3\n2 \u03bdM2(N\u22121)N2\n)2 ,\nso\n\u03c32NN,min \u2265 \u03c32min \u2212 3\n2 \u03bdM2(N\u22121)N2.\nBy an identical argument using (20), we get that, in the case that d0 \u2264 dN , if \u03c31,min denotes the minimum singular value of \u03a31, then\n\u03c32N1,min \u2265 \u03c32min \u2212 3\n2 \u03bdM2(N\u22121)N2.\n(Notice that we have used the fact that the nonzero eigenvalues of W1:NW>1:N are the same as the nonzero eigenvalues of W>1:NW1:N .) This completes the proof of (18).\nUsing Lemma 5, we next show in Lemma 6 that if W1, . . . ,WN are approximately balanced, then an upper bound on \u2016WN \u00b7 \u00b7 \u00b7W1\u2016\u03c3 implies an upper bound on \u2016Wj\u2016\u03c3 for 1 \u2264 j \u2264 N .\nLemma 6. Suppose \u03bd, C are real numbers satisfying C > 0 and 0 < \u03bd \u2264 C 2/N\n30N2 . Moreover suppose that the matrices W1, . . . ,WN satisfy the following:\n1. For 1 \u2264 j \u2264 N \u2212 1, \u2016W>j+1Wj+1 \u2212WjW>j \u2016F \u2264 \u03bd.\n2. \u2016WN \u00b7 \u00b7 \u00b7W1\u2016\u03c3 \u2264 C.\nThen for 1 \u2264 j \u2264 N , \u2016Wj\u2016\u03c3 \u2264 C1/N \u00b7 21/(2N).\nProof. For 1 \u2264 j \u2264 N , let us write the singular value decomposition of Wj as Wj = Uj\u03a3jV >j , where the singular values of Wj are decreasing along the main diagonal of \u03a3j . By Lemma 4, we have that for 1 \u2264 j \u2264 N \u2212 1, \u2016\u03a3>j+1\u03a3j+1 \u2212 \u03a3j\u03a3>j \u2016F \u2264 \u03bd, which implies that\u2223\u2223\u2016\u03a3>j+1\u03a3j+1\u2016\u03c3 \u2212 \u2016\u03a3j\u03a3>j \u2016\u03c3\u2223\u2223 \u2264 \u03bd. Write M = max1\u2264j\u2264N \u2016Wj\u2016\u03c3 = max1\u2264j\u2264N \u2016\u03a3j\u2016\u03c3 . By the above we have that \u2016\u03a3j\u03a3>j \u2016\u03c3 \u2265 M2 \u2212N\u03bd for 1 \u2264 j \u2264 N .\nLet the singular value decomposition of W1:N be denoted by W1:N = U\u03a3V >, so that \u2016\u03a3\u2016\u03c3 \u2264 C. Then by (17) of Lemma 5 and Lemma 4 (see also (22), where the same argument was used), we have that\n\u2016\u03a3\u03a3> \u2212 (\u03a3N\u03a3>N )N\u2016F \u2264 3\n2 \u03bdM2(N\u22121)N2.\nThen\n\u2016(\u03a3N\u03a3>N )N\u2016\u03c3 \u2264 \u2016\u03a3\u03a3>\u2016\u03c3 + 3\n2 \u03bdM (2(N\u22121))N2 \u2264 \u2016\u03a3\u03a3>\u2016\u03c3 +\n3 2 \u03bd ( \u2016\u03a3N\u03a3>N\u2016\u03c3 + \u03bdN )N\u22121 N2.\n(23)\nNow recall that \u03bd is chosen so that \u03bd \u2264 C 2/N\n30\u00b7N2 . Suppose for the purpose of contradiction that there is some j such that \u2016WjW>j \u2016\u03c3 > 21/NC2/N . Then it must be the case that\n\u2016\u03a3N\u03a3>N\u2016\u03c3 > 21/NC2/N \u2212 \u03bd \u00b7N \u2265 (5/4)1/NC2/N > \u03bd \u00b7 30N2, (24)\nwhere we have used that 21/N \u2212 (5/4)1/N \u2265 1\n30N for all N \u2265 2, which follows by considering the Laurent series exp(1/z) = \u2211\u221e i=1 1 i!zi , which converges in |z| > 0 for z \u2208 C. We now rewrite inequality (24) as\n\u03bd \u2264 \u2016\u03a3N\u03a3 > N\u2016\u03c3\n30N2 . (25)\nNext, using (25) and (1 + 1/x)x \u2264 e for all x > 0,\n3 2 \u03bd ( \u2016\u03a3N\u03a3>N\u2016\u03c3 + \u03bdN )N\u22121 N2 \u2264 e 1/30 20 \u00b7 \u2016\u03a3N\u03a3>N\u2016N\u03c3 < e 20 \u00b7 \u2016\u03a3N\u03a3>N\u2016N\u03c3 . (26)\nSince \u2016(\u03a3N\u03a3>N )N\u2016\u03c3 = \u2016\u03a3N\u03a3>N\u2016N\u03c3 , we get by combining (23) and (26) that\n\u2016\u03a3N\u03a3>N\u2016\u03c3 < (1\u2212 e/20)\u22121/N \u00b7 \u2016\u03a3\u03a3>\u20161/N\u03c3 \u2264 (1\u2212 e/20)\u22121/N \u00b7 C2/N ,\nand since 1\u2212e/20 > 1/(5/4), it follows that \u2016\u03a3N\u03a3>N\u2016\u03c3 < (5/4)1/NC2/N , which contradicts (24). It follows that for all 1 \u2264 j \u2264 N , \u2016WjW>j \u2016\u03c3 \u2264 21/NC2/N . The conclusion of the lemma then follows from the fact that \u2016WjW>j \u2016\u03c3 = \u2016Wj\u20162\u03c3 . D.2.2 SINGLE-STEP DESCENT\nLemma 7 below states that if certain conditions on W1(t), . . . ,WN (t) are met, the sought-after descent \u2014 Equation (9) \u2014 will take place at iteration t. We will later show (by induction) that the required conditions indeed hold for every t, thus the descent persists throughout optimization. The proof of Lemma 7 is essentially a discrete, single-step analogue of the continuous proof for Lemma 1 (covering the case of gradient flow) given in Section 3. Lemma 7. Assume the conditions of Theorem 1. Moreover, suppose that for some t, the matrices W1(t), . . . ,WN (t) and the end-to-end matrix W (t) := W1:N (t) satisfy the following properties:\n1. \u2016Wj(t)\u2016\u03c3 \u2264 (4\u2016\u03a6\u2016F )1/N for 1 \u2264 j \u2264 N .\n2. \u2016W (t)\u2212 \u03a6\u2016\u03c3 \u2264 \u2016\u03a6\u2016F .\n3. \u2016W>j+1(t)Wj+1(t)\u2212Wj(t)W>j (t)\u2016F \u2264 2\u03b4 for 1 \u2264 j \u2264 N \u2212 1.\n4. \u03c3min := \u03c3min(W (t)) \u2265 c.\nThen, after applying a gradient descent update (4) we have that\nL(W (t+ 1))\u2212 L(W (t)) \u2264 \u2212\u03b7 2 \u03c3 2(N\u22121)/N min \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u22252 F .\nProof. For simplicity write M = (4\u2016\u03a6\u2016F )1/N and B = \u2016\u03a6\u2016F . We first claim that\n\u03b7 \u2264 min\n{ 1\n2MN\u22122BN ,\n\u03c3 2(N\u22121)/N min 24 \u00b7 2M3N\u22124N2B , \u03c3 2(N\u22121)/N min 24N2M4(N\u22121) ,\n\u03c3 2(N\u22121)/(3N) min\n(24 \u00b7 4M6N\u22128N4B2)1/3\n} . (27)\nSince c \u2264 \u03c3min, for (27) to hold it suffices to have\n\u03b7 \u2264 min  18\u2016\u03a6\u2016(2N\u22122)/NF N , c2(N\u22121)/N 3 \u00b7 211\u2016\u03a6\u20164(N\u22121)/NF N2 ,\nc2(N\u22121)/(3N) 3 \u00b7 26 ( \u2016\u03a6\u2016(8N\u22128)/NF )1/3 N4/3\n .\nAs the minimum singular value of \u03a6 must be at least c, we must have c \u2264 \u2016\u03a6\u2016\u03c3 . Since then c \u2016\u03a6\u2016F \u2264 c \u2016\u03a6\u2016\u03c3 \u2264 1, it holds that\nc2(N\u22121)/N\n\u2016\u03a6\u20164(N\u22121)/NF \u2264 min\n{ 1\n\u2016\u03a6\u20162(N\u22121)/NF , c2(N\u22121)/(3N) \u2016\u03a6\u2016(8N\u22128)/(3N)F\n} ,\nmeaning that it suffices to have\n\u03b7 \u2264 c 2(N\u22121)/N\n3 \u00b7 211N2\u2016\u03a6\u20164(N\u22121)/NF ,\nwhich is guaranteed by (7).\nNext, we claim that\n2\u03b4 \u2264 min\n{ c2(N\u22121)/N\n8 \u00b7 24N3\u2016\u03a6\u20162(N\u22122)/NF ,\nc2\n6 \u00b7 24N2\u2016\u03a6\u20162(N\u22121)/NF\n} (28)\n\u2264 min\n{ \u03c3\n2(N\u22121)/N min\n8N3M2(N\u22122) , \u03c32min 6N2M2(N\u22121)\n} .\nThe second inequality above is trivial, and for the first to hold, since c \u2264 \u2016\u03a6\u2016F , it suffices to take\n2\u03b4 \u2264 c 2\n128 \u00b7N3 \u00b7 \u2016\u03a6\u20162(N\u22121)/NF ,\nwhich is guaranteed by the definition of \u03b4 in Theorem 1.\nNext we continue with the rest of the proof. It follows from (14) that18\nW (t+ 1)\u2212W (t)\n= j=N\u220f 1 ( Wj(t)\u2212 \u03b7W>j+1:N (t) dL dW (W (t))W>1:j\u22121(t) ) \u2212W1:N (t)\n= \u2212\u03b7  N\u2211 j=1 Wj+1:NW > j+1:N (t) dL dW (W (t))W>1:j\u22121(t)W1:j\u22121(t) + (?), (29) where (?) denotes higher order terms in \u03b7. We now bound the Frobenius norm of (?). To do this, note that since L(W ) = 12\u2016W \u2212 \u03a6\u2016 2 F , dL dW (W (t)) = W (t)\u2212 \u03a6. Then\n\u2016(?)\u2016F \u2264 N\u2211 k=2 \u03b7k \u00b7Mk(N\u22121)+N\u2212k \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225k\u22121 \u03c3 \u00b7 ( N k )\n\u2264 \u03b7M2N\u22122N \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F N\u2211 k=2 ( \u03b7MN\u22122BN )k\u22121 \u2264 \u03b7 \u00b7 (2\u03b7M3N\u22124N2B) \u00b7\n\u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F , (30)\n18Here, for matrices A1, . . . , AK such that AKAK\u22121 \u00b7 \u00b7 \u00b7A1 is defined, we write \u220fj=K\n1 Aj := AKAK\u22121 \u00b7 \u00b7 \u00b7A1.\nwhere the last inequality uses \u03b7MN\u22122BN \u2264 1/2, which is a consequence of (27). Next, by Lemma 5 with \u03bd = 2\u03b4,\u2225\u2225\u2225\u2225\u2225 N\u2211 j=1 Wj+1:NW > j+1:N (t) dL dW (W (t))W>1:j\u22121(t)W1:j\u22121(t)\n\u2212 N\u2211\nj=1\n(WNW > N )\nN\u2212j dL\ndW (W (t))(W>1 W1) j\u22121 \u2225\u2225\u2225\u2225\u2225 F\n\u2264 \u2225\u2225\u2225\u2225\u2225 N\u2211\nj=1\n(Wj+1:NW > j+1:N (t)\u2212 (WNW>N )N\u2212j)\ndL dW (W (t))W>1:j\u22121(t)W1:j\u22121(t) \u2225\u2225\u2225\u2225\u2225 F\n+ \u2225\u2225\u2225\u2225\u2225 N\u2211\nj=1\n(WNW > N )\nN\u2212j dL\ndW (W (t))(W>1:j\u22121W1:j\u22121 \u2212 (W>1 W1)j\u22121) \u2225\u2225\u2225\u2225\u2225 F\n\u2264 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F \u00b7 ( N\u22121\u2211 j=1 3 2 2\u03b4 \u00b7M2(N\u2212j)(N \u2212 j)2M2(j\u22121) + N\u2211 j=2 3 2 2\u03b4 \u00b7M2(j\u22122)(j \u2212 1)2M2(N\u2212j) )\n\u2264 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F \u00b7 2\u03b4N3M2(N\u22122).\nNext, by standard properties of tensor product, we have that\nvec  N\u2211 j=1 (WNW > N ) N\u2212j dL dW (W (t))(W>1 W1) j\u22121  =\nN\u2211 j=1 ( (W>1 W1) j\u22121 \u2297 (WNW>N )N\u2212j ) vec ( dL dW (W (t)) ) .\nLet us write eigenvalue decompositions W>1 W1 = UDU >,WNW > N = V EV >. Then N\u2211 j=1 ( (W>1 W1) j\u22121 \u2297 (WNW>N )N\u2212j )\n= N\u2211 j=1 ( UDj\u22121U> \u2297 V EN\u2212jV > )\n= (U \u2297 V )  N\u2211 j=1 Dj\u22121 \u2297 EN\u2212j  (U \u2297 V )>\n= O\u039bO>, with O = U \u2297 V , and \u039b = \u2211N j=1D\nj\u22121 \u2297 EN\u2212j . As W1 \u2208 Rd1\u00d7d0 , and WN \u2208 RdN\u00d7dN\u22121 , then D \u2208 Rd0\u00d7d0 , E \u2208 RdN\u00d7dN , so \u039b \u2208 Rd0dN\u00d7d0dN . Moreover note that \u039b D0\u2297EN\u22121 +DN\u22121\u2297 E0 = Id0 \u2297 EN\u22121 + DN\u22121 \u2297 IdN . If \u03bbD denotes the minimum diagonal element of D and \u03bbE denotes the minimum diagonal element of E, then the minimum diagonal element of \u039b is therefore at least \u03bbN\u22121D + \u03bb N\u22121 E . But, it follows from Lemma 5 (with \u03bd = 2\u03b4) that\nmax{\u03bbND , \u03bbNE } \u2265 \u03c32min \u2212 3\n2 2\u03b4M2(N\u22121)N2 \u2265 3\u03c32min/4,\nwhere the second inequality follows from (28). Hence the minimum diagonal element of \u039b is at least (\u03c32min/(4/3))\n(N\u22121)/N \u2265 \u03c32(N\u22121)/Nmin /(4/3). It follows as a result of the above inequalities that if we writeE(t) = vec(W (t+1))\u2212vec(W (t))+ \u03b7(O\u039bO>)vec ( dL dW (W (t)) ) , then\n\u2016E(t)\u20162 = \u2225\u2225\u2225\u2225vec(W (t+ 1))\u2212 vec(W (t)) + \u03b7(O\u039bO>)vec( dLdW (W (t)) )\u2225\u2225\u2225\u2225 2\n\u2264 \u03b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F \u00b7 (2\u03b7M3N\u22124N2B + 2\u03b4N3M2(N\u22122)).\nThen we have\nL(W (t+ 1))\u2212 L(W (t)) \u2264 vec ( d\ndW L(W (t))\n)> vec (W (t+ 1)\u2212W (t)) + 1\n2 \u2016W (t+ 1)\u2212W (t)\u20162F\n= \u03b7 ( \u2212vec ( d\ndW L(W (t))\n)> (O\u039bO>)vec ( d\ndW L(W (t))\n) + 1\n\u03b7 vec\n( d\ndW L(W (t))\n)> E(t) )\n+ 1\n2 \u2016W (t+ 1)\u2212W (t)\u20162F\n\u2264 \u03b7 ( \u2212 \u2225\u2225\u2225\u2225 ddW L(W (t)) \u2225\u2225\u2225\u22252 F \u00b7 \u03c3 2(N\u22121)/N min 4/3 + \u2225\u2225\u2225\u2225 ddW L(W (t)) \u2225\u2225\u2225\u22252 F \u00b7 ( 2\u03b7M3N\u22124N2B + 2\u03b4N3M2(N\u22122) ))\n+ 1\n2 \u2016W (t+ 1)\u2212W (t)\u20162F ,\nwhere the first inequality follows since L(W ) = 12\u2016W \u2212 \u03a6\u2016 2 F is 1-smooth as a function of W .\nNext, by (29) and (30),\n\u2016W (t+ 1)\u2212W (t)\u20162F \u2264 2\u03b72 \u00b7 ( NM2(N\u22121) \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F )2 + 2\u03b72 \u00b7 (2\u03b7M3N\u22124N2B)2 \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u22252 F\n= 2\u03b72 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u22252 F \u00b7 ( N2M4(N\u22121) + (4\u03b72M6N\u22128N4B2) ) . (31)\nThus\nL(W (t+ 1))\u2212 L(W (t)) \u2264 \u03b7 \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u22252 F \u00b7 ( \u2212\u03c3 2(N\u22121)/N min 4/3 + 2\u03b7M3N\u22124N2B + 2\u03b4N3M2(N\u22122)\n+\u03b7 \u00b7 (N2M4(N\u22121) + 4\u03b72M6N\u22128N4B2) ) .\nBy (27, 28), which bound \u03b7, 2\u03b4, respectively, we have that\nL(W (t+ 1))\u2212 L(W (t)) \u2264 \u03b7 \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u22252 F \u00b7 ( \u2212\u03c3 2(N\u22121)/N min 4/3 + \u03c3 2(N\u22121)/N min 24 + \u03c3 2(N\u22121)/N min 8 + \u03c3 2(N\u22121)/N min 24 + \u03c3 2(N\u22121)/N min 24 )\n= \u22121 2 \u03c3 2(N\u22121)/N min \u03b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u22252 F . (32) D.2.3 PROOF OF LEMMA 2\nProof of Lemma 2. We use induction on t, beginning with the base case t = 0. Since the weights W1(0), . . . ,WN (0) are \u03b4-balanced, we get that A(0) holds automatically. To establish B(0), note that sinceW1:N (0) has deficiency margin c > 0 with respect to \u03a6, we must have \u2016W1:N (0)\u2212\u03a6\u2016F \u2264 \u03c3min(\u03a6) \u2264 \u2016\u03a6\u2016F , meaning that L1(W1:N (0)) \u2264 12\u2016\u03a6\u2016 2 F .\nFinally, by B(0), which gives \u2016W (0)\u2212 \u03a6\u2016F \u2264 \u2016\u03a6\u2016F , we have that\n\u2016W (0)\u2016\u03c3 \u2264 \u2016W (0)\u2016F \u2264 \u2016W (0)\u2212 \u03a6\u2016F + \u2016\u03a6\u2016F \u2264 2\u2016\u03a6\u2016F . (33)\nTo show that the above implies C(0), we use condition A(0) and Lemma 6 with C = 2\u2016\u03a6\u2016F and \u03bd = 2\u03b4. By the definition of \u03b4 in Theorem 1 and since c \u2264 \u2016\u03a6\u2016F , we have that\n2\u03b4 \u2264 c 2 128 \u00b7N3 \u00b7 \u2016\u03a6\u20162(N\u22121)/NF = \u2016\u03a6\u20162/NF 128N3 \u00b7 c 2 \u2016\u03a6\u20162F < \u2016\u03a6\u20162/NF 30N2 , (34)\nas required by Lemma 6. As A(0) and (33) verify the preconditions 1. and 2., respectively, of Lemma 6, it follows that for 1 \u2264 j \u2264 N , \u2016Wj(t)\u2016\u03c3 \u2264 (2\u2016\u03a6\u2016F )1/N \u00b7 21/(2N) < (4\u2016\u03a6\u2016F )1/N , verifying C(0) and completing the proof of the base case. The proof of Lemma 2 follows directly from the following inductive claims.\n1. A(t),B(t), C(t) \u21d2 B(t + 1). To prove this, we use Lemma 7. We verify first that the preconditions hold. First, C(t) immediately gives condition 1. of Lemma 7. By B(t), we have that \u2016W (t) \u2212 \u03a6\u2016\u03c3 \u2264 \u2016W (t) \u2212 \u03a6\u2016F \u2264 \u2016\u03a6\u2016F , giving condition 2. of Lemma 7. A(t) immediately gives condition 3. of Lemma 7. Finally, by B(t), we have that LN (W1(t), . . . ,WN (t)) \u2264 LN (W1(0), . . . ,WN (0)), so \u03c3min(W1:N (t)) \u2265 c by Claim 1. This verifies condition 4. of Lemma 7. Then Lemma 7 gives that LN (W1(t + 1), . . . ,WN (t + 1)) \u2264 LN (W1(t), . . . ,WN (t)) \u2212 1 2\u03c3min(W (t)) 2(N\u22121)/N\u03b7 \u2225\u2225 dL dW (W (t)) \u2225\u22252 F , establishing B(t+ 1).\n2. A(0),A\u2032(1), . . . ,A\u2032(t),A(t),B(0), . . . ,B(t), C(t) \u21d2 A(t + 1),A\u2032(t + 1). To prove this, note that for 1 \u2264 j \u2264 N \u2212 1,\nW>j+1(t+ 1)Wj+1(t+ 1)\u2212Wj(t+ 1)W>j (t+ 1)\n= ( W>j+1(t)\u2212 \u03b7W1:j(t) dL\ndW (W (t))>Wj+2:N (t) ) \u00b7 ( Wj+1(t)\u2212 \u03b7W>j+2:N (t) dL\ndW (W (t))W>1:j(t) ) \u2212 ( Wj(t)\u2212 \u03b7W>j+1:N (t) dL\ndW (W (t))W>1:j\u22121(t) ) \u00b7 ( W>j (t)\u2212 \u03b7W1:j\u22121(t) dL\ndW (W (t))>Wj+1:N (t)\n) .\nBy B(0), . . . ,B(t), \u2016W1:N (t) \u2212 \u03a6\u2016F \u2264 \u2016\u03a6\u2016F . By the triangle inequality it then follows that \u2016W1:N (t)\u2016\u03c3 \u2264 2\u2016\u03a6\u2016F . Also A(t) gives that for 1 \u2264 j \u2264 N \u2212 1, \u2016Wj(t)W>j (t) \u2212 W>j+1(t)Wj+1(t)\u2016F \u2264 2\u03b4. By Lemma 6 with C = 2\u2016\u03a6\u2016F , \u03bd = 2\u03b4 (so that (34) is satisfied),\u2225\u2225W>j+1(t+ 1)Wj+1(t+ 1)\u2212Wj(t+ 1)W>j (t+ 1)\u2225\u2225F \u2264 \u2016W>j+1(t)Wj+1(t)\u2212Wj(t)W>j (t)\u2016F + \u03b72 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F \u00b7 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 \u03c3\n\u00b7 ( \u2016Wj+2:N (t)\u20162\u03c3\u2016W1:j(t)\u20162\u03c3 + \u2016W1:j\u22121\u20162\u03c3\u2016Wj+1:N\u20162\u03c3 ) \u2264 \u2016W>j+1(t)Wj+1(t)\u2212Wj(t)W>j (t)\u2016F\n+4\u03b72 \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 F \u2225\u2225\u2225\u2225 dLdW (W (t)) \u2225\u2225\u2225\u2225 \u03c3 (2\u2016\u03a6\u2016F )2(N\u22121)/N . (35)\nIn the first inequality above, we have also used the fact that for matricesA,B such thatAB is defined, \u2016AB\u2016F \u2264 \u2016A\u2016\u03c3\u2016B\u2016F . (35) gives us A\u2032(t+ 1).\nWe next establish A(t + 1). By B(i) for 0 \u2264 i \u2264 t, we have that \u2225\u2225 dL dW (W (i)) \u2225\u2225 F = \u2016W \u2212 \u03a6\u2016F \u2264 \u2016\u03a6\u2016F . Using A\u2032(i) for 0 \u2264 i \u2264 t and summing over i gives\n\u2016W>j+1(t+ 1)Wj+1(t+ 1)\u2212Wj(t+ 1)W>j (t+ 1)\u2016F \u2264 \u2016W>j+1(0)Wj+1(0)\u2212Wj(0)W>j (0)\u2016F\n+4(2\u2016\u03a6\u2016F )2(N\u22121)/N \u00b7 \u03b72 t\u2211 i=0 \u2225\u2225\u2225\u2225 dLdW (W (i)) \u2225\u2225\u2225\u22252 F . (36)\nNext, by B(0), . . . ,B(t), we have that L(W (i)) \u2264 L(W (0)) for i \u2264 t. Since W (0) has deficiency margin of c and by Claim 1, it then follows that \u03c3min(W (i)) \u2265 c for all i \u2264 t.\nTherefore, by summing B(0), . . . ,B(t),\n1 2 c2(N\u22121)/N\u03b7 t\u2211 i=0 \u2225\u2225\u2225\u2225 dLdW W (i) \u2225\u2225\u2225\u22252 F\n\u2264 1 2 \u03b7 t\u2211 i=0 \u03c3min(W (i)) 2(N\u22121)/N \u2225\u2225\u2225\u2225 dLdW (W (i)) \u2225\u2225\u2225\u22252 F\n\u2264 L(W (0))\u2212 L(W (t))\n\u2264 L(W (0)) \u2264 1 2 \u2016\u03a6\u20162F .\nTherefore,\n4 (2\u2016\u03a6\u2016F )2(N\u22121)/N \u03b72 t\u2211 i=0 \u2225\u2225\u2225\u2225 dLdW W (i) \u2225\u2225\u2225\u22252 F\n\u2264 16\u2016\u03a6\u20162(N\u22121)/NF \u03b7 \u2016\u03a6\u20162F\nc2(N\u22121)/N\n\u2264 16\u2016\u03a6\u20162(N\u22121)/NF \u00b7 1 3 \u00b7 211 \u00b7N3 \u00b7 c\n(4N\u22122)/N\n\u2016\u03a6\u2016(6N\u22124)/NF \u00b7 \u2016\u03a6\u2016 2 F c2(N\u22121)/N (37)\n\u2264 c 2\n256N3\u2016\u03a6\u20162(N\u22121)/NF = \u03b4,\nwhere (37) follows from the definition of \u03b7 in (7), and the last equality follows from definition of \u03b4 in Theorem 1. By (36), it follows that\n\u2016W>j+1(t+ 1)Wj+1(t+ 1)\u2212Wj(t+ 1)W>j (t+ 1)\u2016F \u2264 2\u03b4,\nverifying A(t+ 1).\n3. A(t),B(t) \u21d2 C(t). We apply Lemma 6 with \u03bd = 2\u03b4 and C = 2\u2016\u03a6\u2016F . First, the triangle inequality and B(t) give\n\u2016W1:N (t)\u2016\u03c3 \u2264 \u2016\u03a6\u2016\u03c3 + \u2016\u03a6\u2212W1:N (t)\u2016\u03c3 \u2264 \u2016\u03a6\u2016F + \u221a\n2 \u00b7 L(W1:N (t)) \u2264 2\u2016\u03a6\u2016F , verifying precondition 2. of Lemma 6. A(t) verifies condition 1. of Lemma 6, so for 1 \u2264 j \u2264 N , \u2016Wj(t)\u2016\u03c3 \u2264 (4\u2016\u03a6\u2016F )1/N , giving C(t).\nThe proof of Lemma 2 then follows by induction on t. D.3 PROOF OF THEOREM 2\nTheorem 2 is proven by combining Lemma 8 below, which implies that the balanced initialization is likely to lead to an end-to-end matrix W1:N (0) with sufficiently large deficiency margin, with Theorem 1, which establishes convergence. Lemma 8. Let d \u2208 N, d \u2265 20; b2 > b1 \u2265 1 be real numbers (possibly depending on d); and \u03a6 \u2208 Rd be a vector. Suppose that \u00b5 is a rotation-invariant distribution19 over Rd with a well-defined density, such that, for some 0 < < 1,\nPV\u223c\u00b5 [ \u2016\u03a6\u20162\u221a b2d \u2264 \u2016V \u20162 \u2264 \u2016\u03a6\u20162\u221a b1d ] \u2265 1\u2212 .\nThen, with probability at least (1 \u2212 ) \u00b7 3\u22124F (2/ \u221a b1)\n2 , V will have deficiency margin \u2016\u03a6\u20162/(b2d) with respect to \u03a6.\n19Recall that a distribution on vectors V \u2208 Rd is rotation-invariant if the distribution of V is the same as the distribution of OV , for any orthogonal d \u00d7 d matrix O. If V has a well-defined density, this is equivalent to the statement that for any r > 0, the distribution of V conditioned on \u2016V \u20162 = r is uniform over the sphere centered at the origin with radius r.\nThe proof of Lemma 8 is postponed to Appendix D.5, where Lemma 8 will be restated as Lemma 16.\nOne additional technique is used in the proof of Theorem 2, which leads to an improvement in the guaranteed convergence rate. Because the deficiency margin of W1:N (0) is very small, namely O(\u2016\u03a6\u20162/d0) (which is necessary for the theorem to maintain constant probability), at the beginning of optimization, `(t) will decrease very slowly. However, after a certain amount of time, the deficiency margin of W1:N (t) will increase to a constant, at which point the decrease of `(t) will be much faster. To capture this acceleration, we apply Theorem 1 a second time, using the larger deficiency margin at the new \u201cinitialization.\u201d From a geometric perspective, we note that the matrices W1(0), . . . ,WN (0) are very close to 0, and the point at which Wj(0) = 0 for all j is a saddle. Thus, the increase in `(t)\u2212 `(t+1) over time captures the fact that the iterates (W1(t), . . . ,WN (t)) escape a saddle point.\nProof of Theorem 2. Choose some a \u2265 2, to be specified later. By assumption, all entries of the end-to-end matrix at time 0, W1:N (0), are distributed as independent Gaussians of mean 0 and standard deviation s \u2264 \u2016\u03a6\u20162/ \u221a ad20. We will apply Lemma 8 to the vector W1:N (0) \u2208 Rd0 . Since its distribution is obviously rotation-invariant, in remains to show that the distribution of the norm \u2016W1:N (0)\u20162 is not too spread out. The following lemma \u2014 a direct consequence of the Chernoff bound applied to the \u03c72 distribution with d0 degrees of freedom \u2014 will give us the desired result:\nLemma 9 (Laurent and Massart (2000), Lemma 1). Suppose that d \u2208 N and V \u2208 Rd is a vector whose entries are i.i.d. Gaussians with mean 0 and standard deviation s. Then, for any k > 0,\nP [ \u2016V \u201622 \u2265 s2 ( d+ 2k + 2 \u221a kd )] \u2264 exp(\u2212k)\nP [ \u2016V \u201622 \u2264 s2 ( d\u2212 2 \u221a kd )] \u2264 exp(\u2212k).\nBy Lemma 9 with k = d0/16, we have that P [ s2d0\n2 \u2264 \u2016V \u201622 \u2264 2s2d0\n] \u2265 1\u2212 2 exp(\u2212d0/16).\nWe next use Lemma 8, with b1 = \u2016\u03a6\u201622/(2s2d20), b2 = 2\u2016\u03a6\u201622/(s2d20); note that since a \u2265 2, b1 \u2265 1, as required by the lemma. Lemma 8 then implies that with probability at least\n(1\u2212 2 exp(\u2212d0/16)) 3\u2212 4F\n( 2/ \u221a a/2 )\n2 , (38)\nW1:N (0) will have deficiency margin s2d0/2\u2016\u03a6\u20162 with respect to \u03a6. By the definition of balanced initialization (Procedure 1) W1(0), . . . ,WN (0) are 0-balanced. Since 24 \u00b7 6144 < 105, our assumption on \u03b7 gives\n\u03b7 \u2264 (s 2d0)\n4\u22122/N\n24 \u00b7 6144N3\u2016\u03a6\u201610\u22126/N2 , (39)\nso that Equation (7) holds with c = s 2d0\n2\u2016\u03a6\u20162 . The conditions of Theorem 1 thus hold with probability at least that given in Equation (38). In such a constant probability event, by Theorem 1 (and the fact that a positive deficiency margin implies L1(W1:N (0)) \u2264 12\u2016\u03a6\u2016 2 2), if we choose\nt0 \u2265 \u03b7\u22121 (\n2\u2016\u03a6\u20162 s2d0\n)2\u22122/N ln(4), (40)\nthen L1(W1:N (t0)) \u2264 18\u2016\u03a6\u2016 2 2, meaning that \u2016W1:N (t0) \u2212 \u03a6\u20162 \u2264 12\u2016\u03a6\u20162 = \u2016\u03a6\u20162 \u2212 1 2\u03c3min(\u03a6). Moreover, by condition A(t0) of Lemma 2 and the definition of \u03b4 in Theorem 1, we have, for 1 \u2264 j \u2264 N \u2212 1,\n\u2016WTj+1(t0)Wj+1(t0)\u2212Wj(t0)WTj (t0)\u2016F \u2264 2s4d20\n(2\u2016\u03a6\u20162)2 \u00b7 256N3\u2016\u03a6\u20162\u22122/N2 =\ns4d20\n512N3\u2016\u03a6\u20164\u22122/N2 .\n(41)\nWe now apply Theorem 1 again, verifying its conditions again, this time with the initialization (W1(t0), . . . ,WN (t0)). First note that the end-to-end matrix W1:N (t0) has deficiency margin c = \u2016\u03a6\u20162/2 as shown above. The learning rate \u03b7, by Equation (39), satisfies Equation (7) with c = \u2016\u03a6\u20162/2. Finally, since\ns4d20 512N3\u2016\u03a6\u20164\u22122/N2 \u2264 \u2016\u03a6\u2016 2/N (a2d20) \u00b7 512N3 \u2264 \u2016\u03a6\u2016 2/N (1/2)2 256N3\nfor d0 \u2265 2, by Equation (41), the matrices W1(t0), . . . ,WN (t0) are \u03b4-balanced with \u03b4 = \u2016\u03a6\u20162/N (1/2)2\n256N3 . Iteration t0 thus satisfies the conditions of Theorem 1 with deficiency margin \u2016\u03a6\u20162/2, meaning that for T \u2212 t0 \u2265 \u03b7\u22121 \u00b7 22\u22122/N \u00b7 \u2016\u03a6\u20162/N\u22122 ln ( \u2016\u03a6\u201622\n8\n) , (42)\nwe will have `(T ) \u2264 . Therefore, by Equations (40) and (42), to ensure that `(T ) \u2264 , we may take T \u2265 4\u03b7\u22121 ( ln(4) ( \u2016\u03a6\u20162 s2d0 )2\u22122/N + \u2016\u03a6\u20162/N\u221222 ln(\u2016\u03a6\u201622/(8 )) ) .\nRecall that this entire analysis holds only with the probability given in Equation (38). As limd\u2192\u221e(1 \u2212 2 exp(\u2212d/16)) = 1 and lima\u2192\u221e(3 \u2212 4F (2 \u221a 2/a))/2 = 1/2, for any 0 < p < 1/2, there exist a, d\u20320 > 0 such that for d0 \u2265 d\u20320, the probability given in Equation (38) is at least p. This completes the proof.\nIn the context of the above proof, we remark that the expressions 1 \u2212 2 exp(\u2212d0/16) and (3 \u2212 4F (2 \u221a 2/a))/2 converge to their limits of 1 and 1/2, respectively, as d0, a\u2192\u221e quite quickly. For instance, to obtain a probability of greater than 0.25 of the initialization conditions being met, we may take d0 \u2265 100, a \u2265 100. D.4 PROOF OF CLAIM 2\nWe first consider the probability of \u03b4-balancedness holding between any two layers: Lemma 10. Suppose a, b, d \u2208 N and A \u2208 Ra\u00d7d, B \u2208 Rd\u00d7b are matrices whose entries are distributed as i.i.d. Gaussians with mean 0 and standard deviation s. Then for k \u2265 1,\nP [\u2225\u2225ATA\u2212BBT\u2225\u2225\nF \u2265 ks2\n\u221a 2d(a+ b)2 + d2(a+ b) ] \u2264 1/k2. (43)\nProof. Note that for 1 \u2264 i, j \u2264 d, let Xij be the random variable (ATA\u2212BBT )ij , so that\nXij = (A TA\u2212BBT )ij = \u2211 1\u2264`\u2264a A`iA`j \u2212 \u2211 1\u2264r\u2264b BirBjr.\nIf i 6= j, then E[X2] = \u2211 1\u2264`\u2264a E[A2`iA2`j ] + \u2211 1\u2264r\u2264b E[B2irB2jr] = (a+ b)s4.\nWe next note that for a normal random variable Y of variance s2 and mean 0, E[Y 4] = 3s4. Then if i = j,\nE[X2] = s4 \u00b7 (3(a+ b) + a(a\u2212 1) + b(b\u2212 1)\u2212ab) \u2264 s4((a+ b)2 + 2(a+ b)). Thus\nE[\u2016ATA\u2212BBT \u20162F ] \u2264 s4(d((a+ b)2 + 2(a+ b)) + d(d\u2212 1)(a+ b)) \u2264 s4(2d(a+ b)2 + d2(a+ b)).\nThen (43) follows from Markov\u2019s inequality.\nNow the proof of Claim 2 follows from a simple union bound:\nProof of Claim 2. By (43) of Lemma 10, for each 1 \u2264 j \u2264 N \u2212 1, k \u2265 1, P [ \u2016WTj+1Wj+1 \u2212WjWTj \u2016F\u2265ks2 \u221a 10d3max ] \u2264 1/k2.\nBy the union bound, P [ \u22001 \u2264 j \u2264 N \u2212 1, \u2016WTj+1Wj+1 \u2212WjWTj \u2016F \u2264 ks2 \u221a 10d3max ] \u2265 1\u2212N/k2,\nand the claim follows with \u03b4 = ks2 \u221a\n10d3max. D.5 PROOF OF CLAIM 3\nWe begin by introducing some notation. Given d \u2208 N and r > 0, we let Bd(r) denote the open ball of radius r centered at the origin in Rd. For an open subset U \u2282 Rd, let \u2202U := U\u0304\\U be its boundary, where U\u0304 denotes the closure of U . For the special case of U = Bd(r), we will denote by Sd(r) the boundary of such a ball, i.e. the sphere of radius r centered at the origin in Rd. Let Sd := Sd(1) and Bd := Bd(1). There is a well-defined uniform (Haar) measure on Sd(r) for all d, r, which we denote by \u03c3d,r; we assume \u03c3d,r is normalized so that \u03c3d,r(Sd(r)) = 1. Finally, since in the context of this claim we have dN = 1, we allow ourselves to regard the end-to-end matrix W1:N \u2208 R1\u00d7d0 as both a matrix and a vector.\nTo establish Claim 3, we will use the following low-degree anti-concentration result of Carbery and Wright (2001) (see also Lovett (2010); Meka et al. (2016)): Lemma 11 (Carbery and Wright (2001)). There is an absolute constant C0 such that the following holds. Suppose that h is a multilinear polynomial of K variables X1, . . . , XK and of degree N . Suppose that X1, . . . , XK are i.i.d. Gaussian. Then, for any > 0:\nP [ |h(X1, . . . , XK)| \u2264 \u00b7 \u221a Var[h(X1, . . . , XK)] ] \u2264 C0N 1/N .\nThe below lemma characterizes the norm of the end-to-end matrix W1:N following zero-centered Gaussian initialization: Lemma 12. For any constant 0 < C2 < 1, there is an absolute constant C1 > 0 such that the following holds. Let N, d0, . . . , dN\u22121 \u2208 N. Set dN = 1. Suppose that for 1 \u2264 j \u2264 N , Wj \u2208 Rdj\u00d7dj\u22121 are matrices whose entries are i.i.d. Gaussians of standard deviation s and mean 0. Then\nP [ s2Nd1 \u00b7 \u00b7 \u00b7 dN\u22121 ( 1\nC1N\n)2N \u2264 \u2016W1:N\u201622 \u2264 C1d20d1 \u00b7 \u00b7 \u00b7 dN\u22121s2N ] \u2265 C2.\nProof. Let f(W1, . . . ,WN ) = \u2016W1:N\u201622, so that f is a polynomial of degree 2N in the entries of W1, . . . ,WN . Notice that\nf(W1, . . . ,WN ) = d0\u2211 i0=1  d1\u2211 i1=1 \u00b7 \u00b7 \u00b7 dN\u22121\u2211 iN\u22121=1 (WN )1,iN\u22121(WN\u22121)iN\u22121,iN\u22122 \u00b7 \u00b7 \u00b7 (W1)i1,i0 2 . For 1 \u2264 i0 \u2264 d0, set\ngi0(W1, . . . ,WN ) = d1\u2211 i1=1 \u00b7 \u00b7 \u00b7 dN\u22121\u2211 iN\u22121=1 (WN )1,iN\u22121(WN\u22121)iN\u22121,iN\u22122 \u00b7 \u00b7 \u00b7 (W1)i1,i0 ,\nso that f = \u2211d0 i0=1\ng2i0 . Since each gi0 is a multilinear polynomial in W1, . . . ,WN , we have that E[gi0(W1, . . . ,WN )] = 0 for all 1 \u2264 i0 \u2264 d0. Also\nVar[gi0(W1, . . . ,WN )] = E[gi0(W1, . . . ,WN )2]\n= d1\u2211 i1=1 \u00b7 \u00b7 \u00b7 dN\u22121\u2211 iN\u22121=1 E [ (WN ) 2 1,iN\u22121(WN\u22121) 2 iN\u22121,iN\u22122 \u00b7 \u00b7 \u00b7 (W1) 2 i1,i0 ] = d1d2 \u00b7 \u00b7 \u00b7 dN\u22121s2N .\nIt then follows by Markov\u2019s inequality that for any k \u2265 1, P[g2i0 \u2265 ks 2Nd1 \u00b7 \u00b7 \u00b7 dN\u22121] \u2264 1/k. For any constant B1 (whose exact value will be specified below), it follows that\nP[f(W1, . . . ,WN )\u2265B1d20d1d2 \u00b7 \u00b7 \u00b7 dN\u22121s2N ]\n= P [ d0\u2211 i0=1 gi0(W1, . . . ,WN ) 2\u2265B1d20d1d2 \u00b7 \u00b7 \u00b7 dN\u22122s2N ] \u2264 d0 \u00b7 P[g1(W1, . . . ,WN )2\u2265B1d0d1 \u00b7 \u00b7 \u00b7 dN\u22121s2N ] \u2264 1/B1. (44)\nNext, by Lemma 11, there is an absolute constant C0 > 0 such that for any > 0, and any 1 \u2264 i0 \u2264 d0,\nP [ |gi0(W1, . . . ,WN )| \u2264 N \u221a s2Nd1 \u00b7 \u00b7 \u00b7 dN\u22121 ] \u2264 C0N .\nSince f2 \u2265 g2i0 for each i0, it follows that\nP[f(W1, . . . ,WN ) \u2265 2Ns2Nd1 \u00b7 \u00b7 \u00b7 dN\u22121] \u2265 1\u2212 C0N . (45)\nNext, given 0 < C2 < 1, choose = (1\u2212 C2)/(2C0N), and B1 = 2/(1\u2212 C2). Then by (44) and (45) and a union bound, we have that\nP [( 1\u2212 C2 2C0N )2N s2Nd1 \u00b7 \u00b7 \u00b7 dN\u22121 \u2264 f(W1, . . . ,WN ) \u2264\n2\n1\u2212 C2 s2Nd20d1 \u00b7 \u00b7 \u00b7 dN\u22121\n] \u2265 C2.\nThe result of the lemma then follows by taking C1 = max {\n2 1\u2212C2 , 2C0 1\u2212C2\n} .\nLemma 13. Let N, d0, . . . , dN\u22121 \u2208 N, and set dN = 1. Suppose Wj \u2208 Rdj\u00d7dj\u22121 for 1 \u2264 j \u2264 N , are matrices whose entries are i.i.d. Gaussians with mean 0 and standard deviation s. Then, the distribution of W1:N is rotation-invariant.\nProof. First we remark that for any orthogonal matrix O \u2208 Rd0\u00d7d0 , the distribution of W1 is the same as that of W1O. To see this, let us denote the rows of W1 by (W1)1, . . . , (W1)d1 , and the columns of O by O1, . . . , Od0 . Then the (i1, i0) entry of W1O, for 1 \u2264 i1 \u2264 d1, 1 \u2264 i0 \u2264 d0 is \u3008(W1)i1 , Oi0\u3009, which is a Gaussian with mean 0 and standard deviation s, since \u2016Oi0\u20162 = 1. Since \u3008Oi0 , Oi\u20320\u3009 = 0 for i0 6= i\u20320, the covariance between any two distinct entries of W1O is 0. Therefore, the entries of W1O are independent Gaussians with mean 0 and standard deviation s, just as are the entries of W1.\nBut now for any matrix O \u2208 Rd0\u00d7d0 , the distribution of W1:NO is the distribution of WNWN\u22121 \u00b7 \u00b7 \u00b7W2(W1O), which is the same as the distribution of WNWN\u22121 \u00b7 \u00b7 \u00b7W2W1 = W1:N , since W1,W2, . . . ,WN are all independent.\nFor a dimension d \u2208 N, radius r > 0, and 0 < h < r, a (d, r)-hyperspherical cap of height h is a subset C \u2282 Bd(r) of the form {x \u2208 Bd(r) : \u3008x, u\u3009 \u2265 r \u2212 h}, where u is any d-dimensional unit vector. We define the area of a (d, r)-hyperspherical cap of height h\u2014 C\u2014 to be \u03c3d,r(\u2202C\u2229Sd(r)). Lemma 14. For d \u2265 20, choose any 0 \u2264 h \u2264 1. Then, the area of a (d, 1)-hyperspherical cap of height h is at least\n3\u2212 4F ((1\u2212 h) \u221a d\u2212 3)\n2 .\nProof. In Chudnov (1986), it is shown that the area of a (d, 1)-hyperspherical cap of height h is given by 1\u2212Cd\u22122(h)/Cd\u22122(0)2 , where\nCd(h) := \u222b 1\u2212h 0 (1\u2212 t2)(d\u22121)/2dt.\nNext, by the inequality 1\u2212 t2 \u2265 exp(\u22122t2) for 0 \u2264 t \u2264 1/2,\u222b 1 0 (1\u2212 t2)(d\u22123)/2dt \u2265 \u222b 1/2 0 exp ( 2 \u00b7 \u2212t 2(d\u2212 3) 2 ) dt\n= \u221a \u03c0/(d\u2212 3) \u00b7\n2F ( \u221a\n(d\u2212 3)/2)\u2212 1 2\n\u2265 \u221a \u03c0/(d\u2212 3) \u00b7 1\u2212 2 exp(\u2212(d\u2212 3)/4)\n2 , (46)\nwhere the last inequality follows from the standard estimate F (x) \u2265 1 \u2212 exp(\u2212x2/2) for x \u2265 1. Also, since 1\u2212 t2 \u2264 exp(\u2212t2) for all t,\u222b 1\u2212h\n0\n(1\u2212 t2)(d\u22123)/2dt \u2264 \u222b 1\u2212h\n0\nexp\n( \u2212t2(d\u2212 3)\n2\n) dt\n= \u221a 2\u03c0/(d\u2212 3) \u00b7 2F ((1\u2212 h) \u221a d\u2212 3)\u2212 1\n2 . (47)\nTherefore, for d \u2265 20, by (46) and (47),\n1\u2212 Cd\u22122(h)/Cd\u22122(0) 2 \u2265 1\u2212\n\u221a 2\u00b7(2F ((1\u2212h) \u221a d\u22123)\u22121)\n1\u22122 exp(\u2212(d\u22123)/4)\n2\n\u2265 1\u2212 \u221a 2 \u00b7 (2F ((1\u2212 h) \u221a d\u2212 3)\u2212 1) \u00b7 (1 + 4 exp(\u2212(d\u2212 3)/4))\n2\n\u2265 3\u2212 4F ((1\u2212 h) \u221a d\u2212 3)\n2 ,\nwhere the second inequality has used 1/(1 \u2212 y) \u2264 1 + 2y for all 0 < y < 1/2 (and where y = 2 exp((\u2212(d\u2212 3)/4)) < 2 exp(\u221217/4) < 1/2), and the final inequality uses 1 + 4 exp(\u2212(d\u2212 3)/4) \u2264 \u221a 2 for d \u2265 20. The above chain of inequalities gives us the desired result.\nLemma 15. Let d \u2208 N, d \u2265 20; a \u2265 1 be a real number (possibly depending on d); and \u03a6 \u2208 Rd be some vector. Set r = \u2016\u03a6\u20162/ \u221a ad, and suppose that V \u2208 Sd(r) is drawn according to the uniform measure. Then, with probability at least 3\u22124F (2/ \u221a a)\n2 , V will have deficiency margin \u2016\u03a6\u20162/(ad) with respect to \u03a6.\nProof. By rescaling, we may assume without loss of generality that \u2016\u03a6\u20162 = 1, so that r = 1/ \u221a ad. Let D denote the intersection of Bd(r) with the open d-ball of radius 1\u2212 1/(ad) centered at \u03a6. Let C \u2282 Bd(r) denote the (d, r)-hyperspherical cap of height r \u00b7 ( 1 \u2212 2/( \u221a ad) )\n= r \u2212 2/(ad) whose base is orthogonal to the line between 0 and \u03a6 (see Figure 2). Note that \u03c3d,r(\u2202D\u2229Sd(r)), the Haar measure of the portion of \u2202D intersecting Sd(r), gives the probability that V belongs to the boundary of D. By Lemma 14 above (along with rescaling arguments), since d \u2265 20, \u03c3d,r(\u2202C \u2229 Sd(r)) \u2265 1 2 \u00b7 (3\u2212 4F (2/ \u221a a)), and therefore V \u2208 \u2202C with at least this probability.\nWe next claim that C \u2286 D. To see this, first let T \u2282 Rd denote the (d \u2212 1)-sphere of radius 1 \u2212 1/(ad) centered at \u03a6 (see Figure 2). Let P be the intersection of T with the line from 0 to \u03a6, and Q denote the intersection of this line with the unique hyperplane of codimension 1 containing T \u2229 \u2202Bd(r) \u2014 we denote this hyperplane by H. If we can show that \u2016P \u2212Q\u20162 \u2264 1/(ad), then it follows that C lies entirely on the other side ofH as 0, which will complete the proof that C \u2286 D. The calculation of \u2016P \u2212Q\u20162 is simply an application of the law of cosines: letting \u03b8 be the angle determining the intersection of \u2202Bd(r) and T (see Figure 2), note that\n(1\u2212 1/(ad))2 = r2 + 12 \u2212 2r cos \u03b8 = 1/(ad) + 1\u2212 2/ \u221a ad \u00b7 cos(\u03b8),\nso d(P,Q) = r cos \u03b8 \u2212 1/(ad) = 1\n2 (1/(ad)\u2212 1/(a2d2)) < 1/(ad),\nas desired.\nUsing that C \u2286 D, we continue with the proof. Notice the fact that C \u2286 D is equivalent to \u2202C \u2229 Sd(r) \u2286 \u2202D \u2229 Sd(r), by the structure of C and D. Since the probability that V lands in \u2202C is at least 3\u22124F (2/ \u221a a)\n2 , this lower bound applies to V landing in \u2202D as well. Since all V \u2208 \u2202D have distance at most 1 \u2212 1/(ad) from \u03a6, and since \u03c3min(\u03a6) = \u2016\u03a6\u20162 = 1, it follows that for any V \u2208 \u2202D, \u2016V \u2212 \u03a6\u20162 \u2264 \u03c3min(\u03a6) \u2212 1/(ad). Therefore, with probability of at least 3\u22124F (2/ \u221a a)\n2 , V has deficiency margin \u2016\u03a6\u20162/(ad) with respect to \u03a6.\nLemma 16 (Lemma 8 restated). Let d \u2208 N, d \u2265 20; b2 > b1 \u2265 1 be real numbers (possibly depending on d); and \u03a6 \u2208 Rd be a vector. Suppose that \u00b5 is a rotation-invariant distribution over Rd with a well-defined density, such that, for some 0 < < 1,\nPV\u223c\u00b5 [ \u2016\u03a6\u20162\u221a b2d \u2264 \u2016V \u20162 \u2264 \u2016\u03a6\u20162\u221a b1d ] \u2265 1\u2212 .\nThen, with probability at least (1 \u2212 ) \u00b7 3\u22124F (2/ \u221a b1)\n2 , V will have deficiency margin \u2016\u03a6\u20162/(b2d) with respect to \u03a6.\nProof. By rescaling we may assume that \u2016\u03a6\u20162 = 1 without loss of generality. Then the deficiency margin of V is equal to 1 \u2212 \u2016V \u2212 \u03a6\u20162. \u00b5 has a well-defined density, so we can set \u00b5\u0302 to be the\nprobability density function of \u2016V \u20162. Since \u00b5 is rotation-invariant, we can integrate over spherical coordinates, giving\nP[1\u2212 \u2016V \u2212 \u03a6\u20162 \u2265 1/(b2d)]\n= \u222b \u221e 0 P [ 1\u2212 \u2016V \u2212 \u03a6\u20162 \u2265 1/(b2d) \u2223\u2223 \u2016V \u20162 = r]\u00b5\u0302(r)dr \u2265\n\u222b 1/(\u221ab1d) 1/( \u221a b2d) 3\u2212 4F (2r \u221a d) 2 \u00b5\u0302(r)dr\n\u2265 3\u2212 4F (2/ \u221a b1) 2 \u00b7 \u222b 1/(\u221ab1d)\n1/( \u221a b2d)\n\u00b5\u0302(r)dr\n\u2265 3\u2212 4F (2/ \u221a b1)\n2 \u00b7 (1\u2212 ),\nwhere the first inequlaity used Lemma 15 and the fact that the distribution of V conditioned on \u2016V \u20162 = r is uniform on Sd(r).\nNow we are ready to prove Claim 3:\nProof of Claim 3. We let W \u2208 R1\u00d7d0 ' Rd0 denote the random vector W1:N ; also let \u00b5 denote the distribution of W , so that by Lemma 13, \u00b5 is rotation-invariant. Let C1 be the constant from Lemma 12 for C2 = 999/1000. For some a \u2265 105, the standard deviation of the entries of each Wj is given by\ns =\n( \u2016\u03a6\u201622\nad30d1 \u00b7 \u00b7 \u00b7 dN\u22121C1\n)1/(2N) . (48)\nThen by Lemma 12,\nP [ \u2016\u03a6\u201622 ad30C1 \u00b7 ( 1 C1N )2N \u2264 \u2016W\u201622 \u2264 \u2016\u03a6\u201622 ad0 ] \u2265 999 1000 .\nThen Lemma 16, with d = d0, b1 = a and b2 = ad20C1 \u00b7 (C1N)2N , implies that with probability at least 9991000 \u00b7 3\u22124F (2/ \u221a a) 2 , W has deficiency margin \u2016\u03a6\u20162/(ad 3 0C 2N+1 1 N\n2N ) with respect to \u03a6. But a \u2265 105 implies that this probability is at least 0.49, and from (48),\n\u2016\u03a6\u20162 ad30C 2N+1 1 N 2N = s2Nd1 \u00b7 \u00b7 \u00b7 dN\u22121 \u2016\u03a6\u20162(C1N)2N . (49)\nNext recall the assumption in the hypothesis that s \u2265 C1N(c \u00b7 \u2016\u03a6\u20162 /(d1 \u00b7 \u00b7 \u00b7 dN\u22121))1/2N . Then the deficiency margin in (49) is at least(\nC1N(c \u2016\u03a6\u20162 /(d1 \u00b7 \u00b7 \u00b7 dN\u22121))1/(2N) )2N\nd1 \u00b7 \u00b7 \u00b7 dN\u22121 \u2016\u03a6\u20162(C1N)2N = c,\ncompleting the proof. D.6 PROOF OF CLAIM 4\nProof. The target matrices \u03a6 that will be used to prove the claim satisfy \u03c3min(\u03a6) = 1. We may assume without loss of generality that c \u2265 3/4, the reason being that if a matrix has deficiency margin c with respect to \u03a6 and c\u2032 < c, it certainly has deficiency margin c\u2032 with respect to \u03a6.\nWe first consider the case d = 1, so that the target and all matrices are simply real numbers; we will make a slight abuse of notation in identifying 1 \u00d7 1 matrices with their unique entries. We set \u03a6 = 1. For all choices of \u03b7, we will set the initializations W1(0), . . . ,WN (0) so that W1:N (0) = c. Then\n\u2016W1:N (0)\u2212 \u03a6\u2016F = |W1:N (0)\u2212 \u03a6| = 1\u2212 c = \u03c3min(\u03a6)\u2212 c,\nso the initial end-to-end matrix W1:N (0) \u2208 R1\u00d71 has deficiency margin c. Now fix \u03b7. Choose A \u2208 R with\nA = max {\u221a \u03b7N,\n2\n\u03b7(1\u2212 c)c(N\u22121)/N , 2000, 20/\u03b7,\n( 20 \u00b7 102N\u22121\n\u03b72N\n)1/(2N\u22122)} . (50)\nWe will set:\nWj(0) = { Ac1/N : 1 \u2264 j \u2264 N/2 c1/N/A : N/2 < j \u2264 N, (51)\nso that W1:N (0) = c. Then since LN (W1, . . . ,WN ) = 12 (1 \u2212WN \u00b7 \u00b7 \u00b7W1) 2, the gradient descent updates are given by\nWj(t+ 1) = Wj(t)\u2212 \u03b7(W1:N (t)\u2212 1) \u00b7W1:j\u22121(t)Wj+1:N (t), where we view W1(t), . . . ,WN (t) as real numbers. This gives\nWj(1) = { c1/NA\u2212 \u03b7(c\u2212 1)c(N\u22121)/N/A : 1 \u2264 j \u2264 N/2 c1/N/A\u2212 \u03b7(c\u2212 1)c(N\u22121)/NA : N/2 < j \u2264 N.\nSince 3/4 \u2264 c < 1 and \u2212\u03b7(c \u2212 1)c(N\u22121)/NA \u2265 0, we have that A/2 \u2264 3A/4 \u2264 Wj(1) for 1 \u2264 j \u2264 N/2. Next, since 1\u2212c\n1\u2212c1/N \u2264 N for 0 \u2264 c < 1, we have that A 2 \u2265 \u03b7N \u2265 \u03b7(1\u2212c) 1\u2212c1/N , which\nimplies thatA2 \u2265 c1/NA2+\u03b7(1\u2212c), or c1/NA+ \u03b7(1\u2212c)A \u2264 A. ThusWj(1) \u2264 A forN/2 < j \u2264 N . Similarly, using the same bound 3/4 \u2264 c < 1 and the fact that \u03b7(1 \u2212 c)c(N\u22121)/NA \u2265 2 we get 3 16\u03b7A \u2264 Wj(1) \u2264 \u03b7A for N/2 < j \u2264 N . In particular, for all 1 \u2264 j \u2264 N , we have that min{\u03b7,1}\n10 A \u2264Wj(1) \u2264 max{\u03b7, 1}A. We prove the following lemma by induction:\nLemma 17. For each t \u2265 1, the real numbers W1(t), . . . ,WN (t) all have the same sign and this sign alternates for each integer t. Moreover, there are real numbers 2 \u2264 B(t) < C(t) for t \u2265 1 such that for 1 \u2264 j \u2264 N , B(t) \u2264 |Wj(t)| \u2264 C(t) and \u03b7B(t)2N\u22121 \u2265 20C(t).\nProof. First we claim that we may take B(1) = min{\u03b7,1}10 A and C(1) = max{\u03b7, 1}A. We have shown above that B(1) \u2264 Wj(1) \u2264 C(1) for all j. Next we establish that \u03b7B(1)2N\u22121 \u2265 20C(1). If \u03b7 \u2264 1, then\n\u03b7B(1)2N\u22121 = \u03b72N \u00b7 (A/10)2N\u22121 \u2265 20A = 20C(1), where the inequality follows from A \u2265 (\n20\u00b7102N\u22121 \u03b72N\n)1/(2N\u22122) by definition of A. If \u03b7 \u2265 1, then\n\u03b7B(1)2N\u22121 = \u03b7(A/10)2N\u22121 \u2265 20\u03b7A = 20C(1), where the inequality follows from A \u2265 2000 \u2265 ( 20 \u00b7 102N\u22121 )1/(2N\u22122) by definition of A.\nNow, suppose the statement of Lemma 17 holds for some t. Suppose first that Wj(t) are all positive for 1 \u2264 j \u2264 N . Then for all j, as B(t) \u2265 2, and \u03b7B(t)2N\u22121 \u2265 20C(t),\nWj(t+ 1) \u2264 C(t)\u2212 \u03b7 \u00b7 (B(t)N \u2212 1) \u00b7B(t)N\u22121\n\u2264 C(t)\u2212 \u03b7 2 B(t)2N\u22121\n\u2264 \u22129C(t), which establishes that Wj(t+ 1) is negative for all j. Moreover,\nWj(t+ 1) \u2265 \u2212\u03b7(C(t)N \u2212 1) \u00b7 C(t)N\u22121\n\u2265 \u2212\u03b7C(t)2N\u22121. Now set B(t+ 1) = 9C(t) and C(t+ 1) = \u03b7C(t)2N\u22121. Since N \u2265 2, we have that\n\u03b7B(t+ 1)2N\u22121 = \u03b7(9C(t))2N\u22121 \u2265 \u03b793C(t)2N\u22121 > 20\u03b7C(t)2N\u22121 = 20C(t+ 1). The case that all Wj(t) are negative for 1 \u2264 j \u2264 N is nearly identical, with the same values for B(t + 1), C(t + 1) in terms of B(t), C(t), except all Wj(t + 1) will be positive. This establishes the inductive step and completes the proof of Lemma 17.\nBy Lemma 17, we have that for all t \u2265 1, LN (W1(t), . . . ,WN (t)) = 12 (W1:N (t)\u2212 1) 2 \u2265 12 (2 N \u2212 1)2 > 0, thus completing the proof of Claim 4 for the case where all dimensions are equal to 1.\nFor the general case where d0 = d1 = \u00b7 \u00b7 \u00b7 = dN = d for some d \u2265 1, we set \u03a6 = Id, and given c, \u03b7, we setWj(0) to be the d\u00d7d diagonal matrix where all diagonal entries except the first one are equal to 1, and where the first diagonal entry is given by Equation (51), whereA is given by Equation (50). It is easily verified that all entries of Wj(t), 1 \u2264 j \u2264 N , except for the first diagonal element of each matrix, will remain constant for all t \u2265 0, and that the first diagonal elements evolve exactly as in the 1-dimensional case presented above. Therefore the loss in the d-dimensional case is equal to the loss in the 1-dimensional case, which is always greater than some positive constant.\nWe remark that the proof of Claim 4 establishes that the loss `(t) := LN (W1(t), . . . ,WN (t)) grows at least exponentially in t for the chosen initialization. Such behavior, in which gradients and weights explode, indeed takes place in deep learning practice if initialization is not chosen with care. D.7 PROOF OF CLAIM 5\nProof. We will show that a target matrix \u03a6 \u2208 Rd\u00d7d which is symmetric with at least one negative eigenvalue, along with identity initialization (Wj(0) = Id, \u2200j \u2208 {1, . . . , N}), satisfy the conditions of the claim. First, note that non-stationarity of initialization is met, as for any 1 \u2264 j \u2264 N ,\n\u2202LN (W1(0), . . . ,WN (0))\n\u2202Wj(0) = Wj+1:N (0)\n>(W1:N (0)\u2212 \u03a6)W1:j\u22121(0) = Id \u2212 \u03a6 6= 0,\nwhere the last inequality follows since \u03a6 has a negative eigenvalue. To analyze gradient descent we use the following result, which was established in Bartlett et al. (2018):\nLemma 18 (Bartlett et al. (2018), Lemma 6). If W1(0), . . . ,WN (0) are all initialized to identity, \u03a6 is symmetric, \u03a6 = UDU> is a diagonalization of \u03a6, and gradient descent is performed with any learning rate, then for each t \u2265 0 there is a diagonal matrix D\u0302(t) such that Wj(t) = UD\u0302(t)U> for each 1 \u2264 j \u2264 N .\nBy Lemma 18, for any choice of learning rate \u03b7, the end-to-end matrix at time t is given by W1:N (t) = UD\u0302(t)\nNU>. As long as some diagonal element of D is negative, say equal to \u2212\u03bb < 0, then\n`(t) = LN (W1(t), . . . ,WN (t)) = 1\n2 \u2016W1:N (t)\u2212 \u03a6\u20162F =\n1 2 \u2016D\u0302(t)L \u2212D\u20162F \u2265 1 2 \u03bb2 > 0.\nE IMPLEMENTATION DETAILS\nBelow we provide implementation details omitted from our experimental report (Section 4).\nThe platform used for running the experiments is PyTorch (Paszke et al., 2017). For compliance with our analysis, we applied PCA whitening to the numeric regression dataset from UCI Machine Learning Repository. That is, all instances in the dataset were preprocessed by an affine operator that ensured zero mean and identity covariance matrix. Subsequently, we rescaled labels such that the uncentered cross-covariance matrix \u039byx (see Section 2) has unit Frobenius norm (this has no effect on optimization other than calibrating learning rate and standard deviation of initialization to their conventional ranges). With the training objective taking the form of Equation (1), we then computed c \u2014 the global optimum \u2014 in accordance with the formula derived in Appendix A. In our experiments with linear neural networks, balanced initialization was implemented with the assignment written in step (iii) of Procedure 1. In the non-linear network experiment, we added, for each j \u2208 {1, . . . , N \u2212 1}, a random orthogonal matrix to the right of Wj , and its transpose to the left of Wj+1 \u2014 this assignment maintains the properties required from balanced initialization (see Footnote 7). During all experiments, whenever we applied grid search over learning rate, values between 10\u22124 and 1 (in regular logarithmic intervals) were tried.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Uncertainty in deep learning , Authors: Yarin Gal , 2016\n2.  Title: Bayesian compression for deep learning , Authors: Christos Louizos Karen Ullrich... , 2017\n3.  Title: Variational dropout sparsifies deep neural networks , Authors: Dmitry Molchanov Arsenii Ashukha... , 2017\n4.  Title: Structured bayesian pruning via log-normal multiplicative noise , Authors: Kirill Neklyudov Dmitry Molchanov... , 2017\n5.  Title: Soft weight-sharing for neural network compression , Authors: Karen Ullrich Edward Meeds... , 2017\n6.  Title: Trained ternary quantization , Authors: Chenzhuo Zhu Song Han... , 2016\n7.  Title: 2017) presented an approximation for the KL divergence under a (zero-centered) log uniform prior (Eq. (5)). Since our quantizing prior is essentially a composition of shifted log uniform priors, we construct a composition of the approximation given by Molchanov et al , Authors: Molchanov , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["In this paper, the preparation of a neural network for pruning and few-bit quantization is formulated as a variational inference problem. To this end, a quantizing prior that leads to a multi-modal, sparse posterior distribution over weights, is introduced and a differentiable Kullback-Leibler divergence approximation for this prior is derived. After training with Variational Network Quantization, weights can be replaced by deterministic quantization values with small to negligible loss of task accuracy (including pruning by setting weights to 0). The method does not require fine-tuning after quantization. Results are shown for ternary quantization on LeNet-5 (MNIST) and DenseNet (CIFAR-10).", "Our method is an extension of Sparse VD (Molchanov et al., 2017), originally used for network pruning. In contrast, we use a quantizing prior, leading to a multi-modal posterior suitable for fewbit quantization and pruning. Bayesian Compression and Structured Bayesian Pruning (Louizos et al., 2017; Neklyudov et al., 2017) extend Sparse VD to prune whole neurons or filters via groupsparsity constraints. Additionally, in Bayesian Compression the required bit-precision per layer is determined via the posterior variance. In contrast to our method, Bayesian Compression does not explicitly enforce clustering of weights during training and thus requires bit-widths in the range between 5 and 18 bits. Extending our method to include group-constraints for pruning is an interesting direction for future work. Another Bayesian method for simultaneous network quantization and pruning is soft weight-sharing (SWS) (Ullrich et al., 2017), which uses a Gaussian mixture model prior (and a KL term without trainable parameters such that the KL term reduces to the prior entropy). SWS acts like a probabilistic version of k-means clustering with the advantage of automatic collapse of unnecessary mixture components. Similar to learning the codebooks in our method, soft weight-sharing learns the prior from the data, a technique known as empirical Bayes. We cannot directly compare against soft weight-sharing since the authors do not report results on ternary networks. Gal et al. (2017) learn dropout rates by using a continuous relaxation of dropout\u2019s discrete masks (via the concrete distribution). The authors learn layer-wise dropout rates, which does not allow for dropout-rate-based pruning. We experimented with using the concrete distribution for learning codebooks for quantization with promising early results but so far we have observed lower pruning rates or lower accuracy compared to VNQ. A non-probabilistic state-of-the-art method for network ternarization is Trained Ternary Quantization (Zhu et al., 2016) which uses fullprecision shadow weights during training, but quantized forward passes.", "This paper presented a straightforward extension of the work done at [1, 2] for ternary networks through a multimodal quantising prior. It is generally well-written, with extensive preliminaries and clear equations. The visualizations also serve as a nice way to convey the behaviour of the proposed approach. The idea is interesting and well executed so I propose for acceptance. I only have a couple of minor questions: \n- For the KL-divergence approximation you report a maximum difference of 1 nat per weight that seems a bit high; did you experiment with the `naive` Monte Carlo approximation of the bound (e.g. as done at Bayes By Backprop) during optimization? If yes, was there a big difference in performance?\n- Was pre-training necessary to obtain the current results for MNIST? As far as I know, [1] and [2] did not need pre-training for the MNIST results (but did employ pre-training for CIFAR 10).\n- How necessary was each one of the constraints during optimization (and what did they prevent)? \n- Did you ever observe posterior means that do not settle at one of the prior modes but rather stay in between? Or did you ever had issues of the variance growing large enough, so that q(w) captures multiple modes of the prior (maybe the constraints prevent this)? How sensitive is the quantisation scheme?\n\nOther minor comments / typos:\n(1) 7th line of section 2.1 page 2, \u2018a unstructured data\u2019 -> \u2018unstructured data\u2019\n(2) 5th line on page 3, remove \u2018compare Eq. (1)\u2019 (or rephrase it appropriately).\n(3) Section 2.2, \u2019Kullback-Leibler divergence between the true and the approximate posterior\u2019; between implies symmetry (and the KL isn\u2019t symmetric) so I suggest to change it to e.g. \u2018from the true to the approximate posterior\u2019 to avoid confusion. Same for the first line of Section 3.3.\n(4) Footnote 2, the distribution of the noise depends on the random variable so I would suggest to change it to a general \\epsilon \\sim p(\\epsilon).\n(5) Equation 4 is confusing.\n\n7: Good paper, accept"], "input": "1 INTRODUCTION\nParameters of a trained neural network commonly exhibit high degrees of redundancy (Denil et al., 2013) which implies an over-parametrization of the network. Network compression methods implicitly or explicitly aim at the systematic reduction of redundancy in neural network models while at the same time retaining a high level of task accuracy. Besides architectural approaches, such as SqueezeNet (Iandola et al., 2016) or MobileNets (Howard et al., 2017), many compression methods perform some form of pruning or quantization. Pruning is the removal of irrelevant units (weights, neurons or convolutional filters) (LeCun et al., 1990). Relevance of weights is often determined by the absolute value (\u201cmagnitude based pruning\u201d (Han et al., 2016; 2017; Guo et al., 2016)), but more sophisticated methods have been known for decades, e.g., based on second-order derivatives (Optimal Brain Damage (LeCun et al., 1990) and Optimal Brain Surgeon (Hassibi & Stork, 1993)) or ARD (automatic relevance determination, a Bayesian framework for determining the relevance of weights, (MacKay, 1995; Neal, 1995; Karaletsos & Ra\u0308tsch, 2015)). Quantization is the reduction of the bit-precision of weights, activations or even gradients, which is particularly desirable from a hardware perspective (Sze et al., 2017). Methods range from fixed bit-width computation (e.g., 12-bit fixed point) to aggressive quantization such as binarization of weights and activations (Courbariaux et al., 2016; Rastegari et al., 2016; Zhou et al., 2016; Hubara et al., 2016). Few-bit quantization (2 to 6 bits) is often performed by k-means clustering of trained weights with subsequent fine-tuning of the cluster centers (Han et al., 2016). Pruning and quantization methods have been shown to work well in conjunction (Han et al., 2016). In so-called \u201cternary\u201d networks, weights can have one out of three possible values (negative, zero or positive) which also allows for simultaneous pruning and few-bit quantization (Li et al., 2016; Zhu et al., 2016).\nThis work is closely related to some recent Bayesian methods for network compression (Ullrich et al., 2017; Molchanov et al., 2017; Louizos et al., 2017; Neklyudov et al., 2017) that learn a posterior distribution over network weights under a sparsity-inducing prior. The posterior distribution over network parameters allows identifying redundancies through three means: weights with (1) an expected value very close to zero and (2) weights with a large variance can be pruned as they do not contribute much to the overall computation. (3) the posterior variance over non-pruned\nparameters can be used to determine the required bit-precision (quantization noise can be made as large as implied by the posterior uncertainty). Additionally, Bayesian inference over modelparameters is known to automatically reduce parameter redundancy by penalizing overly complex models (MacKay, 2003).\nIn this paper we present Variational Network Quantization (VNQ), a Bayesian network compression method for simultaneous pruning and few-bit quantization of weights. We extend previous Bayesian pruning methods by introducing a multi-modal quantizing prior that penalizes weights of low variance unless they lie close to one of the target values for quantization. As a result, weights are either drawn to one of the quantization target values or they are assigned large variance values\u2014see Fig. 1. After training, our method yields a Bayesian neural network with a multi-modal posterior over weights (typically with one mode fixed at 0), which is the basis for subsequent pruning and quantization. Additionally, posterior uncertainties can also be interesting for network introspection and analysis, as well as for obtaining uncertainty estimates over network predictions (Gal & Ghahramani, 2015; Gal, 2016; Depeweg et al., 2016; 2017). After pruning and hard quantization, and without the need for additional fine-tuning, our method yields a deterministic feed-forward neural network with heavily quantized weights. Our method is applicable to pre-trained networks but can also be used for training from scratch. Target values for quantization can either be manually fixed or they can be learned during training. We demonstrate our method for the case of ternary quantization on LeNet-5 (MNIST) and DenseNet (CIFAR-10). 2 PRELIMINARIES\nOur method extends recent work that uses a (variational) Bayesian objective for neural network pruning (Molchanov et al., 2017). In this section, we first motivate such an approach by discussing that the objectives of compression (in the minimum-description-length sense) and Bayesian inference are well-aligned. We then briefly review the core ingredients that are combined in Sparse Variational Dropout (Molchanov et al., 2017). The final idea (and also the starting point of our method) is to learn dropout noise levels per weight and prune weights with large dropout noise. Learning dropout noise per weight can be done by interpreting dropout training as variational inference of an approximate weight-posterior under a sparsity inducing prior - this is known as Variational Dropout which is described in more detail below, after a brief introduction to modern approximate posterior inference\n1Kernel density estimate, with radial basis function kernels with a bandwidth of 0.05\nin Bayesian neural networks by optimizing the evidence lower bound via stochastic gradient ascent and reparameterization tricks. 2.1 WHY BAYES FOR COMPRESSION?\nBayesian inference over model parameters automatically penalizes overly complex parametric models, leading to an automatic regularization effect (Gru\u0308nwald, 2007; Graves, 2011) (see Molchanov et al. (2017), where the authors show that Sparse Variational Dropout (Sparse VD) successfully prevents a network from fitting unstructured data, that is a random labeling). The automatic regularization is based on the objective of maximizing model evidence, also know as marginal likelihood. A very complex model might have a particular parameter setting that achieves extremely good likelihood given the data, however, since the model evidence is obtained via marginalizing parameters, overly complex models are penalized for having many parameter settings with poor likelihood. This effect is also known as \u201cBayesian Occams Razor\u201d in Bayesian model selection (MacKay, 2003; Genewein & Braun, 2014). The argument can be extended to variational Bayesian inference (with some caveats) via the equivalence of the variational Bayesian objective and the Minimum description length (MDL) principle (Rissanen, 1978; Gru\u0308nwald, 2007; Graves, 2011; Louizos et al., 2017). The evidence lower bound (ELBO), which is maximized in variational inference, is composed of two terms: LE , the average message length required to transmit outputs (labels) to a receiver that knows the inputs and the posterior over model parameters and LC , the average message length to transmit the posterior parameters to a receiver that knows the prior over parameters:\nLELBO = neg. reconstr. error\ufe38 \ufe37\ufe37 \ufe38 \u2212LE + neg. KL divergence\ufe38 \ufe37\ufe37 \ufe38 \u2212LC=entropy\u2212cross entropy .\nMaximizing the ELBO minimizes the total message length: maxLELBO = minLE + LC , leading to an optimal trade-off between short description length of the data and the model (thus, minimizing the sum of error cost LE and model complexity cost LC). Interestingly, MDL dictates the use of stochastic models since they are in general \u201cmore compressible\u201d compared to deterministic models: high posterior uncertainty over parameters is rewarded by the entropy term in LC\u2014higher uncertainty allows the quantization noise to be higher, thus, requiring lower bit-precision for a parameter. Variational Bayesian inference can also be formally related to the information-theoretic framework for lossy compression, rate-distortion theory, (Cover & Thomas, 2006; Tishby et al., 2000; Genewein et al., 2015). The only difference is that rate-distortion requires the use of the optimal prior, which is the marginal over posteriors (Hoffman & Johnson, 2016; Tomczak & Welling, 2017; Hoffman et al., 2017) - providing an interesting connection to empirical Bayes where the prior is learned from the data. 2.2 VARIATIONAL BAYES AND REPARAMETERIZATION\nLet D be a dataset of N pairs (xn, yn)Nn=1 and p(y|x,w) be a parameterized model that predicts outputs y given inputs x and parameters w. A Bayesian neural network models a (posterior) distribution over parameters w instead of just a point-estimate. The posterior is given by Bayes\u2019 rule: p(w|D) = p(D|w)p(w)/p(D), where p(w) is the prior over parameters. Computation of the true posterior is in general intractable. Common approaches to approximate inference in neural networks are for instance: MCMC methods pioneered in (Neal, 1995) and later refined, e.g., via stochastic gradient Langevin dynamics (Welling & Teh, 2011), or variational approximations to the true posterior (Graves, 2011), Bayes by Backprop (Blundell et al., 2015), Expectation Backpropagation (Soudry et al., 2014), Probabilistic Backpropagation (Herna\u0301ndez-Lobato & Adams, 2015). In the latter methods the true posterior is approximated by a parameterized distribution q\u03c6(w). Variational parameters \u03c6 are optimized by minimizing the Kullback-Leibler (KL) divergence from the true to the approximate posterior DKL(q\u03c6(w)||p(w|D)). Since computation of the true posterior is intractable, minimizing this KL divergence is approximately performed by maximizing the so-called \u201cevidence\nlower bound\u201d (ELBO) or \u201cnegative variational free energy\u201d (Kingma & Welling, 2014):\nLELBO(\u03c6) = N\u2211 n=1\nEq\u03c6(w)[log p(yn|xn, w)]\ufe38 \ufe37\ufe37 \ufe38 LD(\u03c6) \u2212DKL(q\u03c6(w)||p(w)), (1)\n' LSGVB(\u03c6) = N M M\u2211 m=1 log p(y\u0303m|x\u0303m, f(\u03c6, m))\u2212DKL(q\u03c6(w)||p(w)), (2)\nwhere we have used the Reparameterization Trick2 (Kingma & Welling, 2014) in Eq. (2) to get an unbiased, differentiable, minibatch-based Monte Carlo estimator of the expected log likelihood LD(\u03c6). A mini-batch of data is denoted by (x\u0303m, y\u0303m)Mm=1. Additionally, and in line with similar work (Molchanov et al., 2017; Louizos et al., 2017; Neklyudov et al., 2017), we use the Local Reparameterization Trick (Kingma et al., 2015) to further reduce variance of the stochastic ELBO gradient estimator, which locally marginalizes weights at each layer and instead samples directly from the distribution over pre-activations (which can be computed analytically). See Appendix A.2 for more details on the Local reparameterization. Commonly, the prior p(w) and the parametric form of the posterior q\u03c6(w) are chosen such that the KL divergence term can be computed analytically (e.g. a fully factorized Gaussian prior and posterior, known as the mean-field approximation). Due to the particular choice of prior in our work, a closed-form expression for the KL divergence cannot be obtained but instead we use a differentiable approximation (see Sec. 3.3). 2.3 VARIATIONAL INFERENCE VIA DROPOUT TRAINING\nDropout (Srivastava et al., 2014) is a method originally introduced for regularization of neural networks, where activations are stochastically dropped (i.e., set to zero) with a certain probability p during training. It was shown that dropout, i.e., multiplicative noise on inputs, is equivalent to having noisy weights and vice versa (Wang & Manning, 2013; Kingma et al., 2015). Multiplicative Gaussian noise \u03beij \u223c N (1, \u03b1 = p1\u2212p ) on a weight wij induces a Gaussian distribution\nwij = \u03b8ij\u03beij = \u03b8ij(1 + \u221a \u03b1 ij) \u223c N (\u03b8ij , \u03b1\u03b82ij) (3)\nwith ij \u223c N (0, 1). In standard (Gaussian) dropout training, the dropout rates \u03b1 (or p to be precise) are fixed and the expected log likelihood LD(\u03c6) (first term in Eq. (1)) is maximized with respect to the means \u03b8. Kingma et al. (2015) show that Gaussian dropout training is mathematically equivalent to maximizing the ELBO (both terms in Eq. (1)), under a prior p(w) and fixed \u03b1 where the KL term does not depend on \u03b8:\nL(\u03b1, \u03b8) = Eq\u03b1 [LD(\u03b8)]\u2212DKL(q\u03b1(w)||p(w)), (4) where the dependencies on \u03b1 and \u03b8 of the terms in Eq. (1) have been made explicit. The only prior that meets this requirement is the scale invariant log-uniform prior:\np(log |wij |) = const. \u21d4 p(|wij |) \u221d 1\n|wij | . (5)\nUsing this interpretation, it becomes straightforward to learn individual dropout-rates \u03b1ij per weight, by including \u03b1ij into the set of variational parameters \u03c6 = (\u03b8, \u03b1). This procedure was introduced in (Kingma et al., 2015) under the name \u201cVariational Dropout\u201d. With the choice of a log-uniform prior (Eq. (5)) and a factorized Gaussian approximate posterior q\u03c6(wij) = N (\u03b8ij , \u03b1ij\u03b82ij) (Eq. (3)) the KL term in Eq. (1) is not analytically tractable, but the authors of Kingma et al. (2015) present an approximation\n\u2212DKL(q\u03c6(wij)||p(wij)) \u2248 const.+ 0.5 log\u03b1ij + c1\u03b1ij + c2\u03b12ij + c3\u03b13ij , (6) see the original publication for numerical values of c1, c2, c3. Note that due to the mean-field approximation, where the posterior over all weights factorizes into a product over individual weights q\u03c6(w) = \u220f q\u03c6(wij), the KL divergence factorizes into a sum of individual KL divergences\nDKL(q\u03c6(w)||p(w)) = \u2211 DKL(q\u03c6(wij)||p(wij)).\n2The trick is to use a deterministic, differentiable (w.r.t. \u03c6) function w = f(\u03c6, ) with \u223c p( ), instead of directly using q\u03c6(w). 2.4 PRUNING UNITS WITH LARGE DROPOUT RATES\nLearning dropout rates is interesting for network compression since neurons or weights with very high dropout rates p \u2192 1 can very likely be pruned without loss in accuracy. However, as the authors of Sparse Variational Dropout (sparse VD) (Molchanov et al., 2017) report, the approximation in Eq. (6) is only accurate for \u03b1 \u2264 1 (corresponding to p \u2264 0.5). For this reason, the original variational dropout paper restricted \u03b1 to values smaller or equal to 1, which are unsuitable for pruning. Molchanov et al. (2017) propose an improved approximation, which is very accurate on the full range of log\u03b1: \u2212DKL(q\u03c6(wij)||p(wij)) \u2248 const.+ k1S(k2 + k3 log\u03b1ij)\u2212 0.5 log(1 +\u03b1\u22121ij ) = FKL,LU(\u03b8ij , \u03c3ij), (7) with k1 = 0.63576, k2 = 1.87320 and k3 = 1.48695 and S denoting the sigmoid function. Additionally, the authors propose to use an additive, instead of a multiplicative noise reparameterization, which significantly reduces variance in the gradient \u2202L SGVB\n\u2202\u03b8ij for large \u03b1ij . To achieve this, the multi-\nplicative noise term is replaced by an exactly equivalent additive noise term \u03c3ij ij with \u03c32ij = \u03b1ij\u03b8 2 ij and the set of variational parameters becomes \u03c6 = (\u03b8, \u03c3):\nwij = \u03b8ij (1 + \u221a \u03b1 ij)\ufe38 \ufe37\ufe37 \ufe38\nmult.noise = \u03b8ij +\u03c3ij ij\ufe38 \ufe37\ufe37 \ufe38 add.noise \u223c N (\u03b8ij , \u03c32ij), ij \u223c N (0, 1). (8)\nAfter Sparse VD training, pruning is performed by thresholding \u03b1ij = \u03c32ij \u03b82ij\n. In Molchanov et al. (2017) a threshold of log\u03b1 = 3 is used, which roughly corresponds to p > 0.95. Pruning weights that lie above a threshold of T\u03b1 leads to\n\u03c32ij \u03b82ij \u2265 T\u03b1 \u21d4 \u03c32ij \u2265 T\u03b1\u03b82ij , (9)\nwhich means effectively that weights with large variance but also weights of lower variance and a mean \u03b8ij close to zero are pruned. A visualization of the pruning threshold can be seen in Fig. 1 (the \u201ccentral funnel\u201d, i.e., the area marked by the red dotted lines for a threshold for T\u03b1 = 2). Sparse VD training can be performed from random initialization or with pre-trained networks by initializing the means \u03b8ij accordingly. In Bayesian Compression (Louizos et al., 2017) and Structured Bayesian Pruning (Neklyudov et al., 2017), Sparse VD has been extended to include group-sparsity constraints, which allows for pruning of whole neurons or convolutional filters (via learning their corresponding dropout rates). 2.5 SPARSITY INDUCING PRIORS\nFor pruning weights based on their (learned) dropout rate, it is desirable to have high dropout rates for most weights. Perhaps surprisingly, Variational Dropout already implicitly introduces such a \u201chigh dropout rate constraint\u201d via the implicit prior distribution over weights. The prior p(w) can be used to induce sparsity into the posterior by having high density at zero and heavy tails. There is a well known family of such distributions: scale-mixtures of normals (Andrews & Mallows, 1974; Louizos et al., 2017; Ingraham & Marks, 2017):\nw \u223c N (0, z2); z \u223c p(z), where the scales of w are random variables. A well-known example is the spike-and-slab prior (Mitchell & Beauchamp, 1988), which has a delta-spike at zero and a slab over the real line. Gal & Ghahramani (2015); Kingma et al. (2015) show how Dropout training implies a spike-and-slab prior over weights. The log uniform prior used in Sparse VD (Eq. (5)) can also be derived as a marginalized scale-mixture of normals\np(wij) \u221d \u222b 1\n|zij | N (wij |0, z2ij)dzij =\n1\n|wij | ; p(zij) \u221d\n1\n|zij | , (10)\nalso known as the normal-Jeffreys prior (Figueiredo, 2002). Louizos et al. (2017) discuss how the log-uniform prior can be seen as a continuous relaxation of the spike-and-slab prior and how the alternative formulation through the normal-Jeffreys distribution can be used to couple the scales of weights that belong together and thus, learn dropout rates for whole neurons or convolutional filters, which is the basis for Bayesian Compression (Louizos et al., 2017) and Structured Bayesian Pruning (Neklyudov et al., 2017). 3 VARIATIONAL NETWORK QUANTIZATION\nWe formulate the preparation of a neural network for a post-training quantization step as a variational inference problem. To this end, we introduce a multi-modal, quantizing prior and train by maximizing the ELBO (Eq. (2)) under a mean-field approximation of the posterior (i.e., a fully factorized Gaussian). The goal of our algorithm is to achieve soft quantization, that is learning a posterior distribution such that the accuracy-loss introduced by post-training quantization is small. Our variational posterior approximation and training procedure is similar to Kingma et al. (2015) and Molchanov et al. (2017) with the crucial difference of using a quantizing prior that drives weights towards the target values for quantization. 3.1 A QUANTIZING PRIOR\nThe log uniform prior (Eq. (5)) can be viewed as a continuous relaxation of the spike-and-slab prior with a spike at location 0 (Louizos et al., 2017). We use this insight to formulate a quantizing prior, a continuous relaxation of a \u201cmulti-spike-and-slab\u201d prior which has multiple spikes at locations ck, k \u2208 {1, . . . ,K}. Each spike location corresponds to one target value for subsequent quantization. The quantizing prior allows weights of low variance only at the locations of the quantization target values ck. The effect of using such a quantizing prior during Variational Network Quantization is shown in Fig. 1. After training, most weights of low variance are distributed very closely around the quantization target values ck and can thus be replaced by the corresponding value without significant loss in accuracy. We typically fix one of the quantization targets to zero, e.g., c2 = 0, which allows pruning weights. Additionally, weights with a large variance can also be pruned. Both kinds of pruning can be achieved with an \u03b1ij threshold (see Eq. (9)) as in sparse Variational Dropout (Molchanov et al., 2017). Following the interpretation of the log uniform prior p(wij) as a marginal over the scale-hyperparameter zij , we extend Eq. (10) with a hyper-prior over locations\np(wij) = \u222b N (wij |mij , zij)pz(zij)pm(mij) dzijdmij pm(mij) = \u2211 k ak\u03b4(mij \u2212 ck), (11)\nwith p(zij) \u221d |zij |\u22121. The location prior pm(mij) is a mixture of weighted delta distributions located at the quantization values ck. Marginalizing over m yields the quantizing prior\np(wij) \u221d \u2211 k ak \u222b 1 |zij | N (wij |ck, zij) dzij = \u2211 k ak 1 |wij \u2212 ck| . (12)\nIn our experiments, we use K = 3, ak = 1/K \u2200k and c2 = 0 unless indicated otherwise. 3.2 POST-TRAINING QUANTIZATION\nEq. (9) implies that using a threshold on \u03b1ij as a pruning criterion is equivalent to pruning weights whose value does not differ significantly from zero:\n\u03b82ij \u2264 \u03c32ij T\u03b1 \u21d4 \u03b8ij \u2208 (\u2212 \u03c3ij\u221a T\u03b1 , \u03c3ij\u221a T\u03b1 ). (13)\nTo be precise, T\u03b1 specifies the width of a scaled standard-deviation band \u00b1\u03c3ij/ \u221a T\u03b1 around the mean \u03b8ij . If the value zero lies within this band, the weight is assigned the value 0. For instance, a pruning threshold which implies p \u2265 0.95 corresponds to a variance band of approximately \u03c3ij/4. An equivalent interpretation is that a weight is pruned if the likelihood for the value 0 under the approximate posterior exceeds the threshold given by the standard-deviation band (Eq. (13)):\nN (0|\u03b8ij , \u03c32ij) \u2265 N (\u03b8ij \u00b1 \u03c3ij\u221a T\u03b1 |\u03b8ij , \u03c32ij) = 1\u221a 2\u03c0\u03c3ij e\u2212 1 2T\u03b1 . (14)\nExtending this argument for pruning weights to a quantization setting, we design a post-training quantization scheme that assigns each weight the quantized value ck with the highest likelihood under the approximate posterior. Since variational posteriors over weights are Gaussian, this translates into minimizing the squared distance between the mean \u03b8ij and the quantized values ck:\narg max k N (ck|\u03b8ij , \u03c32ij) = arg max k e \u2212\n(ck\u2212\u03b8ij) 2\n2\u03c32 ij = arg min k (ck \u2212 \u03b8ij)2. (15)\nAdditionally, the pruning rate can be increased by first assigning a hard 0 to all weights that exceed the pruning threshold T\u03b1 (see Eq. (9)) before performing the assignment to quantization levels as described above. 3.3 KL DIVERGENCE APPROXIMATION\nUnder the quantizing prior (Eq. (12)) the KL divergence from the prior DKL(q\u03c6(w)||p(w)) to the mean-field posterior is analytically intractable. Similar to Kingma et al. (2015); Molchanov et al. (2017), we use a differentiable approximation FKL(\u03b8, \u03c3, c)3, composed of a small number of differentiable functions to keep the computational effort low during training. We now present the approximation for a reference codebook c = [\u2212r, 0, r], r = 0.2, however later we show how the approximation can be used for arbitrary ternary, symmetric codebooks as well. The basis of our approximation is the approximation FKL,LU introduced by Molchanov et al. (2017) for the KL divergence from a log uniform prior to a Gaussian posterior (see Eq. (7)) which is centered around zero. We observe that a weighted mixture of shifted versions of FKL,LU can be used to approximate the KL divergence for our multi-modal quantizing prior (Eq. (12)) (which is composed of shifted versions of the log uniform prior). In a nutshell, we shift one version of FKL to each codebook entry ck and then use \u03b8-dependent Gaussian windowing functions \u2126(\u03b8) to mix the shifted approximations (see more details in the Appendix A.3). The approximation for the KL divergence from our multi-modal quantizing prior to a Gaussian posterior is given as\nFKL(\u03b8, \u03c3, c) = \u2211 k:ck 6=0\n\u2126(\u03b8 \u2212 ck)FKL,LU(\u03b8 \u2212 ck, \u03c3)\ufe38 \ufe37\ufe37 \ufe38 local behavior + \u21260(\u03b8)FKL,LU(\u03b8, \u03c3)\ufe38 \ufe37\ufe37 \ufe38 global behavior\n(16)\nwith\n\u2126(\u03b8) = exp(\u22121 2\n\u03b82 \u03c42 ) \u21260(\u03b8) = 1\u2212 \u2211 k:ck 6=0 \u2126(\u03b8 \u2212 ck). (17)\nWe use \u03c4 = 0.075 in our experiments. Illustrations of the approximation, including a comparison against the ground-truth computed via Monte Carlo sampling are shown in Fig. 2. Over the range of \u03b8- and \u03c3-values relevant to our method, the maximum absolute deviation from the ground-truth is 1.07 nats. See Fig. 4 in the Appendix for a more detailed quantitative evaluation of our approximation.\nThis KL approximation in Eq. (16), developed for the reference codebook cr = [\u2212r, 0, r], can be reused for any symmetric ternary codebook ca = [\u2212a, 0, a], a \u2208 R+, since ca can be represented with the reference codebook and a positive scaling factor s, ca = scr, s = a/r. As derived in the Appendix (A.4), this re-scaling translates into a multiplicative re-scaling of the variational parameters \u03b8 and \u03c3. The KL divergence from a prior based on the codebook ca to the posterior q\u03c6(w) is thus given by DKL(q\u03c6(w)||pca(w)) \u2248 FKL(\u03b8/s, \u03c3/s, cr). This result allows learning the quantization level a during training as well. 4 EXPERIMENTS\nIn our experiments, we train with VNQ and then first prune via thresholding log\u03b1ij \u2265 log T\u03b1 = 2. Remaining weights are then quantized by minimizing the squared distance to the quantization values ck (see Sec. 3.2). We use warm-up (S\u00f8nderby et al., 2016), that is, we multiply the KL divergence term (Eq. (2)) with a factor \u03b2, where \u03b2 = 0 during the first few epochs and then linearly ramp up to \u03b2 = 1. To improve stability of VNQ training, we ensure through clipping that log \u03c32ij \u2208 (\u221210, 1) and \u03b8ij \u2208 (\u2212a \u2212 0.3679\u03c3, a + 0.3679\u03c3) (which corresponds to a shifted log\u03b1 threshold of 2, that is, we clip \u03b8ij if it lies left of the \u2212a funnel or right of the +a funnel, compare Fig. 1). This leads to a clipping-boundary that depends on trainable parameters. To avoid weights getting stuck at these boundaries, we use gradient-stopping, that is, we apply the gradient to a so-called \u201cshadow weight\u201d and use the clipped weight-value only for the forward pass. Without this procedure our method still works, but accuracies are a bit worse, particularly on CIFAR-10. When learning codebook values\n3To keep notation in this section simple, we drop the indices ij from w, \u03b8 and \u03c3 but we refer to individual weights and their posterior parameters throughout the section.\na during training, we use a lower learning rate for adjusting the codebook, otherwise we observe a tendency for codebook values to collapse in early stages of training (a similar observation was made by Ullrich et al. (2017)). Additionally, we ensure a \u2265 0.05 by clipping. 4.1 LENET-5 ON MNIST\nWe demonstrate our method with LeNet-54 (LeCun et al., 1998) on the MNIST handwritten digits dataset. Images are pre-processed by subtracting the mean and dividing by the standard-deviation over the training set. For the pre-trained network we run 5 epochs on a randomly initialized network (Glorot initialization, Adam optimizer), which leads to a validation accuracy of 99.2%. We initialize means \u03b8 with the pre-trained weights and variances with log \u03c32 = \u22128. The warm-up factor \u03b2 is linearly increased from 0 to 1 during the first 15 epochs. VNQ training runs for a total of 195 epochs with a batch-size of 128, the learning rate is linearly decreased from 0.001 to 0 and the learning rate for adjusting the codebook parameter a uses a learning rate that is 100 times lower. We initialize with a = 0.2. Results are shown in Table 1, a visualization of the distribution over weights after VNQ training is shown in Fig. 1.\nWe find that VNQ training sufficiently prepares a network for pruning and quantization with negligible loss in accuracy and without requiring subsequent fine-tuning. Training from scratch yields a similar performance compared to initializing with a pre-trained network, with a slightly higher pruning rate. Compared to pruning methods that do not consider few-bit quantization in their objective, we achieve significantly lower pruning rates. This is an interesting observation since our method is based on a similar objective (e.g., compared to Sparse VD) but with the addition of forcing nonpruned weights to tightly cluster around the quantization levels. Few-bit quantization severely limits network capacity. Perhaps this capacity limitation must be countered by pruning fewer weights. Our pruning rates are roughly in line with other papers on ternary quantization, e.g., Zhu et al. (2016), who report sparsity levels between 30% and 50% with their ternary quantization method. Note that\n4the Caffe version, see https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt\na direct comparison between pruning, quantizing and ternarizing methods is difficult and depends on many factors such that a fair computation of the compression rate that does not implicitly favor certain methods is hardly possible within the scope of this paper. For instance, compression rates for pruning methods are typically reported under the assumption of a CSC storage format which would not fully account for the compression potential of a sparse ternary matrix. We thus choose not to report any measures for compression rates, however for the methods listed in Table 1, they can easily be found in the literature. 4.2 DENSENET ON CIFAR-10\nOur second experiment uses a modern DenseNet (Huang et al., 2017) (k = 12, depth L = 76, with bottlenecks) on CIFAR-10 (Krizhevsky & Hinton, 2009). We follow the CIFAR-10 settings of Huang et al. (2017)5. The training procedure is identical to the procedure on MNIST with the following exceptions: we use a batch-size of 64 samples, the warm-up weight \u03b2 of the KL term is 0 for the first 5 epochs and is then linearly ramped up from 0 to 1 over the next 15 epochs, the learning rate of 0.005 is kept constant for the first 50 epochs and then linearly decreased to a value of 0.003 when training stops after 150 epochs. We pre-train a deterministic DenseNet (reaching validation accuracy of 93.19%) to initialize VNQ training. The codebook parameter for non-zero values a is initialized with the maximum absolute value over pre-trained weights per layer. Results are shown in Table 2. A visualization of the distribution over weights after VNQ training is shown in the Appendix Fig. 3.\nWe generally observe lower levels of sparsity for DenseNet, compared to LeNet. This might be due to the fact that DenseNet already has an optimized architecture which removed a lot of redundant parameters from the start. In line with previous publications, we generally observed that the first and last layer of the network are most sensitive to pruning and quantization. However, in contrast to many other methods that do not quantize these layers (e.g., Zhu et al. (2016)), we find that after sufficient training, the complete network can be pruned and quantized with very little additional loss in accuracy (see Table 2). Inspecting the weight scatter-plot for the first and last layer (Appendix Fig. 3, top-left and bottom-right panel) it can be seen that some weights did not settle on one of the\n5Our DenseNet(L = 76, k = 12) consists of an initial convolutional layer (3\u00d7 3 with 16 output channels), followed by three dense blocks (each with 12 pairs of 1 \u00d7 1 convolution bottleneck followed by a 3 \u00d7 3 convolution, number of channels depends on growth-rate k = 12) and a final classification layer (global average pooling that feeds into a dense layer with softmax activation). In-between the dense blocks (but not after the last dense block) are (pooling) transition layers (1 \u00d7 1 convolution followed by 2 \u00d7 2 average pooling with a stride of 2).\nprior modes (the \u201cfunnels\u201d) after VNQ training, particularly the first layer has a few such weights with very low variance. It is likely that quantizing these weights causes the additional loss in accuracy that we observe when quantizing the whole network. Without gradient stopping (i.e., applying gradients to a shadow weight at the trainable clipping boundary) we have observed that pruning and quantizing the first layer leads to a more pronounced drop in accuracy (about 3% compared to a network where the first layer is kept with full precision, not shown in results). 6 DISCUSSION\nA potential shortcoming of our method is the KL divergence approximation (Sec. 3.3). While the approximation is reasonably good on the relevant range of \u03b8- and \u03c3-values, there is still room for improvement which could have the benefit that weights are drawn even more tightly onto the quantization levels, resulting in lower accuracy loss after quantization and pruning. Since our functional\napproximation to the KL divergence only needs to be computed once and an arbitrary amount of ground-truth data can be produced, it should be possible to improve upon the approximation presented here at least by some brute-force function approximation, e.g., a neural network, polynomial or kernel regression. The main difficulty is that the resulting approximation must be differentiable and must not introduce significant computational overhead since the approximation is evaluated once for each network parameter in each gradient step. We have also experimented with a naive Monte-Carlo approximation of the KL divergence term. This has the disadvantage that local reparameterization (where pre-activations are sampled directly) can no longer be used, since weight samples are required for the MC approximation. To keep computational complexity comparable, we used a single sample for the MC approximation. In our LeNet-5 on MNIST experiment the MC approximation achieves comparable accuracy with higher pruning rates compared to our functional KL approximation. However, with DenseNet on CIFAR-10 and the MC approximation validation accuracy plunges catastrophically after pruning and quantization. See Sec. A.3 in the Appendix for more details. Compared to similar methods that only consider network pruning, our pruning rates are significantly lower. This does not seem to be a particular problem of our method since other papers on network ternarization report similar or even lower sparsity levels (Zhu et al. (2016) roughly achieve between 30% and 50% sparsity). The reason for this might be that heavily quantized networks have a much lower capacity compared to full-precision networks. This limited capacity might require that the network compensates by effectively using more weights such that the pruning rates become significantly lower. Similar trends have also been observed with binary networks, where drops in accuracy could be prevented by increasing the number of neurons (with binary weights) per layer. Principled experiments to test the trade-off between low bit-precision and sparsity rates would be an interesting direction for future work. One starting point could be to test our method with more quantization levels (e.g., 5, 7 or 9) and investigate how this affects the pruning rate. A APPENDIX\nA.1 VISUALIZATION OF DENSENET WEIGHTS AFTER VNQ TRAINING\nSee Fig. 3.\nA.2 LOCAL REPARAMETERIZATION\nWe follow Sparse VD (Molchanov et al., 2017) and use the Local Reparameterization Trick (Kingma et al., 2015) and Additive Noise Reparmetrization to optimize the stochastic gradient variational lower bound LSGVB (Eq. (2)). We optimize posterior means and log-variances (\u03b8, log \u03c32) and the codebook level a. We apply Variational Network Quantization to fully connected and convolutional layers. Denoting inputs to a layer with AM\u00d7I , outputs of a layer with BM\u00d7O and using local reparameterization we get:\nbmj \u223c N (\u03b3mj , \u03b4mj); \u03b3mj = I\u2211 i=1 ami\u03b8ij , \u03b4mj = I\u2211 i=1 a2mi\u03c3 2 ij\nfor a fully connected layer. Similarly activations for a convolutional layer are computed as follows\nvec(bmk) \u223c N (\u03b3mk, \u03b4mk); \u03b3mk = vec(Am \u2217 \u03b8k), \u03b4mk = diag(vec(A2m \u2217 \u03c32k)),\nwhere (\u00b7)2 denotes an element-wise operation, \u2217 is the convolution operation and vec(\u00b7) denotes reshaping of a matrix/tensor into a vector.\nA.3 KL APPROXIMATION FOR QUANTIZING PRIOR\nUnder the quantizing prior (Eq. (12)) the KL divergence from the log uniform prior to the meanfield posterior DKL(q\u03c6(wij)||p(wij)) is analytically intractable. Molchanov et al. (2017) presented an approximation for the KL divergence under a (zero-centered) log uniform prior (Eq. (5)). Since our quantizing prior is essentially a composition of shifted log uniform priors, we construct a composition of the approximation given by Molchanov et al. (2017), shown in Eq. (7). The original approximation can be utilized to calculate a KL divergence approximation (up to an additive constant C\u0303) from a shifted log-uniform prior p(wij) \u221d 1|wij\u2212r| to a Gaussian posterior q\u03c6(wij) by transferring the shift to the posterior parameter \u03b8\nDKL ( q{\u03b8ij ,\u03c3ij}||p(wij) \u221d\n1\n|wij \u2212 r|\n) = DKL ( q{\u03b8ij\u2212r,\u03c3ij}(wij)||p(wij) \u221d 1\n|wij |\n) + C\u0303, (18)\nFor small posterior variances \u03c32ij (\u03c3ij r) and means near the quantization levels (i.e., |\u03b8ij | \u2248 r), the KL divergence is dominated by the mixture prior component located at the respective quantization level r. For these values of \u03b8 and \u03c3, the KL divergence can be approximated by shifting the approximation FLU,KL(\u03b8, \u03c3) to the quantization level r, i.e., FLU,KL(\u03b8 \u00b1 r, \u03c3). For small \u03c3 and values of \u03b8 near zero or far away from any quantization level, as well as for large values of \u03c3 and arbitrary \u03b8, the KL divergence can be approximated by the original non-shifted approximation FLU,KL(\u03b8, \u03c3). Based on these observations we construct our KL approximation by properly mixing shifted versions of FLU,KL(\u03b8 \u00b1 r, \u03c3). We use Gaussian window functions \u2126(\u03b8 \u00b1 r) to perform this weighting (to ensure differentiability). The remaining \u03b8 domain is covered by an approximation located at zero and weighted such that this approximation is dominant near zero and far away from the quantization levels, which is achieved by introducing the constraint that all window functions sum up to one on the full \u03b8 domain. See Fig. 2 for a visual representation of shifted approximations and their respective window functions.\nA.3.1 APPROXIMATION QUALITY\nWe evaluate the quality of our KL approximation (Eq. (16)) by comparing against a ground-truth Monte Carlo approximation on a dense grid over the full range of relevant \u03b8 and \u03c3 values. Results of this comparison are shown in Fig. 4. Alternatively to the functional KL approximation, one could also use a naive Monte Carlo approximation directly. This has the disadvantage that local reparameterization can no longer be used, since actual samples of the weights must be drawn. To assess the quality of our functional KL approximation, we also compare against experiments where we use a naive MC approximation of the KL divergence term, where we only use a single sample for approximating the expectation to keep computational complexity comparable to our original method. Note that the \u201cground-truth\u201d MC approximation used before to evaluate KL approximation quality uses many more samples which would be prohibitively expensive during training. To test for the effect of\nlocal reparameterization in isolation we also show results for our functional KL approximation without using local reparameterization. The results in Table 3 show that the naive MC approximation of the KL term leads to slightly lower validation error on MNIST (LeNet-5) (with higher pruning rates) but on CIFAR-10 (DenseNet) the validation error of the network trained with the naive MC approximation catastrophically increases after pruning and quantizing the network. Except for removing local reparameterization or plugging in the naive MC approximation, experiments were ran as described in Sec. 4.\nInspecting the distribution over weights after training with the naive MC approximation for the KL divergence, shown in Fig. 5 for LeNet-5 and in Fig. 6 for DenseNet, reveals that weight-means tend to be more dispersed and weight-variances tend to be generally lower than when training with our functional KL approximation (compare Fig. 1 for LeNet-5 and Fig. 3 for DenseNet). We speculate that the combined effects of missing local reparameterization and single-sample MC approximation lead to more noisy gradients.\nA.4 REUSING THE KL APPROXIMATION FOR ARBITRARY CODEBOOKS\nWe show that the KL approximation (Eq. (16)), developed for a fixed reference codebook, can be reused for arbitrary codebooks as long as codebook learning is restricted to learning a multiplicative scaling factor. Without loss of generality we consider the case of ternary, symmetric codebooks6\ncr = [\u2212r, 0, r]; pcr (w) = 3\u2211 k=1 ak |w \u2212 cr,k|\n(19)\nwhere r \u2208 R+ is the quantization level value and pcr denotes a sparsity-inducing, quantizing prior over weights (sparsity is induced because one of the codebook entries is fixed to 0). We denote cr as the reference codebook for which we design the KL approximation DKL(q\u03c6(w)||pcr ) = FKL(\u03b8, \u03c3, cr) (Eq. (16)). This approximation can be reused for any symmetric ternary codebook ca = [\u2212a, 0, a] with quantization level a \u2208 R+. The latter can be seen by representing ca with the reference codebook and a positive scaling factor s > 0 as ca = scr, s = a/r. This re-scaling translates into a multiplicative re-scaling of the variational parameters \u03b8 and \u03c3. To see this, consider the prior pca , based on codebook ca:\npca(w) = 1\nZ 3\u2211 k=1 ak |w \u2212 ca,k| = 1 Z 3\u2211 k=1 ak |w \u2212 scr,k| . (20)\nThe KL divergence from a prior based on the codebook ca to the posterior q\u03c6(w) is given by DKL(q\u03c6(w)||pca(w)) = \u222b q\u03c6(w) log\nq\u03c6(w)\u22113 k=1 ak |w\u2212ca,k| dw + C\n= \u222b q\u03c6(w) log\nq\u03c6(w) 1 s \u22113 k=1 ak |ws \u2212cr,k| dw + C | subst. z = w s ,dw = sdz\n= \u222b q\u03c6(sz) log\nq\u03c6(sz) 1 s \u22113 k=1 ak |z\u2212cr,k| sdz + C. (21)\nSince q\u03b8(sz) is Gaussian, the scaling s can be transfered into the variational parameters \u03c6 = (\u03b8, \u03c3):\nq\u03c6(sz) = N (s; \u03b8, \u03c32) = 1 s N (z; \u03b8 s , \u03c32 s2 ) = 1 s q\u03c6\u0302(z),\nwith \u03c6\u0302 = ( \u03b8s , \u03c3 s ). Inserting into Eq. (21) yields:\nDKL(q\u03c6(w)||pca(w)) = \u222b 1\ns q\u03c6\u0302(z) log\n1 sq\u03c6\u0302(z)\n1 s \u22113 k=1 ak |z\u2212cr,k| sdz + C.\n= \u222b q\u03c6\u0302(z) log\nq\u03c6\u0302(z)\u22113 k=1 ak |z\u2212cr,k| dz + C.\n= DKL(q\u03c6\u0302(w)||pcr (w)) + C. (22)\nThus,DKL(q\u03c6(w)||pca(w)) = DKL(q\u03c6\u0302(w)||pcr (w))+C \u2248 FKL(\u03b8/s, \u03c3/s, cr), where FKL is given by Eq. (16). This means that the KL approximation can be used for arbitrary ternary, symmetric codebooks of the form ca = [\u2212a, 0, a] = scr because the scaling factor s translates into a re-scaling of the variational parameters \u03c6\u0302 = ( \u03b8s , \u03c3 s ).\n6Note that indices ij have been dropped for notational brevity from the whole section. However, throughout the section we refer to individual weights wij and their variational parameters \u03b8ij and \u03c3ij", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Deep compositional question answering with neural module , Authors: Jacob Andreas Marcus Rohrbach... , 2015\n2.  Title: Net2net: Accelerating learning via knowledge transfer , Authors: Tianqi Chen Ian Goodfellow... , 2015\n3.  Title: Pathnet: Evolution channels gradient descent in super neural networks , Authors: Chrisantha Fernando Dylan Banarse... , 2017\n4.  Title: Learning to Reason: End-to-End Module Networks for Visual Question Answering , Authors: R. Hu J. Andreas... , 2017\n5.  Title: Overcoming catastrophic forgetting , Authors: James Kirkpatrick Razvan Pascanu... , 2016\n6.  Title: ImageNet classification with deep convolutional neural networks , Authors: A Krizhevsky I Sutskever... , 2012\n7.  Title: Learning to reinforcement learn , Authors: Jane X. Wang Zeb Kurth-Nelson... , 2016\n8.  Title: Improved multitask learning through synaptic intelligence , Authors: Friedemann Zenke Ben Poole... , 2017\n9.  Title: There are a total of 59 unique classes in this dataset , Authors: Yamins , 2014\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["A core aspect of human intelligence is the ability to learn new tasks quickly and switch between them flexibly. Here, we describe a modular continual reinforcement learning paradigm inspired by these abilities. We first introduce a visual interaction environment that allows many types of tasks to be unified in a single framework. We then describe a reward map prediction scheme that learns new tasks robustly in the very large state and action spaces required by such an environment. We investigate how properties of module architecture influence efficiency of task learning, showing that a module motif incorporating specific design principles (e.g. early bottlenecks, low-order polynomial nonlinearities, and symmetry) significantly outperforms more standard neural network motifs, needing fewer training examples and fewer neurons to achieve high levels of performance. Finally, we present a meta-controller architecture for task switching based on a dynamic neural voting scheme, which allows new modules to use information learned from previouslyseen tasks to substantially improve their own learning efficiency.", "Modern deep convolutional neural networks have had significant impact on computer vision and artificial intelligence (Krizhevsky et al., 2012), as well as in the computational neuroscience of vision (Yamins & DiCarlo (2016)). There is a recent but growing literature on convnet-based neural modules, where they have been used for solving compositional visual reasoning tasks (Andreas et al., 2015; Hu et al., 2017). In this work we apply the idea of modules to solving visual learning challenges in a continual learning context. Existing works rely on choosing between a menu of pre-specified module primitives, using different module types to solve subproblems involving specific input-output datatypes, without addressing how these modules\u2019 forms are to be discovered in the first place. In this paper, we show a single generic module architecture is capable of automatically learning to solve a wide variety of different tasks in a unified action/state space, and a simple controller scheme is able to switch between such modules. Our results are also closely connected with the literature on lifelong (or continual) learning (Kirkpatrick et al., 2016; Rusu et al., 2016). A part of this literature is concerned with learning to solve new tasks without catastrophically forgetting how to solve old ones (Zenke et al., 2017; Kirkpatrick et al., 2016). The use of modules obviates this problem, but instead shifts the hard question to one of how newly-allocated modules can be learned effectively. The continual learning literature also directly addresses knowlege transfer to newly allocated structures (Chen et al., 2015; Rusu et al., 2016; Fernando et al., 2017), but largely addresses how transfer learning can lead to higher performance, rather than addressing how it can improve learning speed. ", "Reading this paper feels like reading at least two closely-related papers compressed into one, with overflow into the appendix (e.g. one about the EMS module, one about the the recurrent voting, etc).\n\nThere were so many aspects/components, that I am not entirely confident I fully understood how they all work together, and in fact I am pretty confident there was at least some part of this that I definitely did not understand. Reading it 5-20 more times would most likely help.\n\n-The defs/eqns at the beginning of section 3.1 (Sc, CReLu, etc) were slightly hard to follow and I wonder whether there were any typos, e.g. was CReS meant to refer directly to Sc, but used the notation ${ReLu}^2$ instead? \n\nEach of these on its own would be easier to overlook, but there is a compounding effect here for me, as a reader, such that by further on in the paper, I am rather confused.\n\nI also wonder whether any of the elements described, have more \"standard\" interpretations/notations. For example, my slight confusion propagated further: after above point, I then did not have a clear intuition about $l_i$ in the EMS module. I get that symmetry has been built in, e.g. by the definitions of CReS and CReLu, etc, but I still don't see how it all works together, e.g. are late bottleneck architectures *exactly* the same as MLPs, but where inputs have simply been symmetrized, squared, etc? Nor do I have intuition about multiplicative symmetric interactions between visual features and actions, although I do get the sense that if I were to spend several hours implementing/writing out toy examples, it would clarify it significantly (in fact, I wouldn't be too surprised if it turns out to be fairly straightforward, as in my above comment indicating a seeming equivalence to simply multiplying two terms and taking the resulting sign). If the paper didn't need to be quite as dense, then I would suggest providing more elucidation for the reader, either with intuitions or examples or clearer relationships to more familiar formulations.\n\nLater, I did find that some of the info I *needed* in order to understand the results (e.g. exactly what is meant by a \"symmetry ablation\", how was that implemented?) was in fact in the appendices (of which there are over 8 pages).\n\nI do wonder how sensitive the performance of the overall system is to some of the details, like, e.g. the low-temp Boltzmann sampling rather than identity function, as described at the end of S2.\n\n8: Top 50% of accepted papers, clear accept"], "input": "INTRODUCTION\nIn the course of everyday functioning, people are constantly faced with real-world environments in which they are required to shift unpredictably between multiple, sometimes unfamiliar, tasks (Botvinick & Cohen, 2014). They are nonetheless able to flexibly adapt existing decision schemas or build new ones in response to these challenges (Arbib, 1992). How humans support such flexible learning and task switching is largely unknown, both neuroscientifically and algorithmically (Wagner et al., 1998; Cole et al., 2013).\nWe investigate solving this problem with a neural module approach in which simple, task-specialized decision modules are dynamically allocated on top of a largely-fixed underlying sensory system (Andreas et al., 2015; Hu et al., 2017). The sensory system computes a general-purpose visual representation from which the decision modules read. While this sensory backbone can be large, complex, and learned comparatively slowly with significant amounts of training data, the task modules that deploy information from the base representation must, in contrast, be lightweight, quick to be learned, and easy to switch between. In the case of visually-driven tasks, results from neuroscience and computer vision suggest the role of the fixed general purpose visual representation may be played by the ventral visual stream, modeled as a deep convolutional neural network (Yamins & DiCarlo, 2016; Razavian et al., 2014). However, the algorithmic basis for how to efficiently learn and dynamically deploy visual decision modules remains far from obvious.\nIn standard supervised learning, it is often assumed that the output space of a problem is prespecified in a manner that just happens to fit the task at hand \u2013 e.g. for a classification task, a discrete output with a fixed number of classes might be determined ahead of time, while for a continuous estimation problem, a one-dimensional real-valued target might be chosen instead. This is a very convenient simplification in supervised learning or single-task reinforcement learning contexts, but if one is interested in the learning and deployment of decision structures in a rich environment defining tasks with many different natural output types, this simplification becomes cumbersome.\nTo go beyond this limitation, we build a unified environment in which many different tasks are naturally embodied. Specifically, we model an agent interacting with a two-dimensional touchscreenlike GUI that we call the TouchStream, in which all tasks (discrete categorization tasks, continuous estimation problems, and many other combinations and variants thereof) can be encoded using a single common and intuitive \u2013 albeit large \u2013 output space. This choice frees us from having to hand-design or programmatically choose between different output domain spaces, but forces us to confront the core challenge of how a naive agent can quickly and emergently learn the implicit \u201cinterfaces\u201d required to solve different tasks.\nWe then introduce Reward Map Prediction (ReMaP) networks, an algorithm for continual reinforcement learning that is able to discover implicit task-specific interfaces in large action spaces like those of the TouchStream environment. We address two major algorithmic challenges associated with learning ReMaP modules. First, what module architectural motifs allow for efficient task interface learning? We compare several candidate architectures and show that those incorporating certain intuitive design principles (e.g. early visual bottlenecks, low-order polynomial nonlinearities and symmetry-inducing concatenations) significantly outperform more standard neural network motifs, needing fewer training examples and fewer neurons to achieve high levels of performance. Second, what system architectures are effective for switching between tasks? We present a meta-controller architecture based on a dynamic neural voting scheme, allowing new modules to use information learned from previously-seen tasks to substantially improve their own learning efficiency.\nIn \u00a7 1 we formalize the TouchStream environment. In \u00a7 2, we introduce the ReMaP algorithm. In \u00a7 3, we describe and evaluate comparative performance of multiple ReMaP module architectures on a variety of TouchStream tasks. In \u00a7 4, we describe the Dynamic Neural Voting meta-controller, and evaluate its ability to efficiently transfer knowledge between ReMaP modules on task switches. 1 THE TOUCHSTREAM ENVIRONMENT\nAgents in a real-world environment are exposed to many different implicit tasks, arising without predefined decision structures, and must learn on the fly what the appropriate decision interfaces are\nfor each situation. Because we are interested in modeling how agents can do this on-the-fly learning, our task environment should mimic the unconstrained nature of the real world. Here, we describe the TouchStream environment, which attempts to do this in a simplified two-dimensional domain.\nOur problem setup consists of two components, an \u201cenvironment\u201d and an \u201cagent,\u201d interacting over an extended temporal sequence (Fig. 1). At each timestep t, the environment emits an RGB image xt of height H and width W , and a scalar reward rt. Conversely, the agent accepts images and rewards as input and chooses an action at in response. The action space A available to the agent consists of a two-dimensional pixel grid {0, . . . ,H \u2212 1}\u00d7{0, . . . ,W \u2212 1} \u2282 Z2, of the same height and width as its input image. The environment is equipped with a policy (unknown to the agent) that on each time step computes image xt and reward rt as a function of the history of agent actions {a0, . . . , at\u22121}, images {x0, . . . , xt\u22121} and rewards {r0, . . . , rt\u22121}. In this work, the agent is a neural network, composed of a visual backbone with fixed weights, together with a meta-controller module whose parameters are learned by interaction with the environment. The agent\u2019s goal is to learn to enact a policy that maximizes its reward obtained over time. Unlike an episodic reinforcement learning context, the TouchStream environment is continuous: throughout the course of learning the agent is never signaled when it should reset to some \u201cinitial\u201d internal state. However, unlike the traditional continuous learning context of e.g. Sutton & Barto (1998), a TouchStream may implicitly define many different tasks, each of which is associated with its own characteristic reward schedule. The agent experiences a continual stream of tasks, and any implicit association between reward schedule and state reset must be discovered by the agent.\nBy framing the action space A of the agent as all possible pixel locations and the state space as any arbitrary image, a very wide range of possible tasks are unified in this single framework, at the cost of requiring the agents\u2019 action space to be congruent to its input state space, and thus be quite large. This presents two core efficiency challenges for the agent: on any given task, it must be able to both quickly recognize what the \u201cinterface\u201d for the task is, and transfer such knowledge across tasks in a smart way. Both of these goals are complicated by the fact that both the large size of agent\u2019s state and action spaces.\nAlthough we work with modern large-scale computer vision-style datasets and tasks in this work, e.g. ImageNet (Deng et al. (2009)) and MS-COCO (Lin et al. (2014)), we are also inspired by visual psychology and neuroscience, which have pioneered techniques for how controlled visual tasks can be embodied in real reinforcement learning paradigms (Horner et al., 2013; Rajalingham et al., 2015). Especially useful are three classes of task paradigms that span a range of the ways discrete and continuous estimation tasks can be formulated \u2013 including Stimulus-Response, Match-To-Sample, and Localization tasks (Fig. 2).\nStimulus-Response Tasks: The Stimulus-Response (SR) paradigm is a common approach to physically embodying discrete categorization tasks (Gaffan & Harrison, 1988). For example, in the simple two-way SR discrimination task shown in Fig. 2a, the agent is rewarded if it touches the left half of the screen after being shown an image of a dog, and the right half after being shown a butterfly. SR tasks can be made more difficult by increasing the number of image classes or the complexity of the reward boundary regions. In our SR experiments, we use images and classes from the ImageNet dataset (Deng et al., 2009).\nMatch-To-Sample Tasks: The Match-to-Sample (MTS) paradigm is another common approach to assessing visual categorization abilities (Murray & Mishkin, 1998). In the MTS task shown in Fig. 2b, trials consist of a sequence of two image frames \u2013 the \u201csample\u201d screen followed by the \u201cmatch\u201d screen \u2013 in which the agent is expected to remember the object category seen on the sample frame, and then select an onscreen \u201cbutton\u201d (really, a patch of pixels) on the match screen corresponding to the sample screen category. Unlike SR tasks, MTS tasks require some working memory and more localized spatial control. More complex MTS tasks involve more sophisticated relationships between the sample and match screen. In Fig. 2c, using the MS-COCO object detection challenge dataset (Lin et al., 2014), the sample screen shows an isolated template image indicating one of the 80 MS-COCO classes, while the match screen shows a randomly-drawn scene from the dataset containing at least one instance of the sample-image class. The agent is rewarded if its chosen action is located inside the boundary of an instance (e.g. the agent \u201cpokes inside\u201d) of the correct class. This MS-COCO MTS task is a \u201chybrid\u201d of categorical and continuous elements, meaning that if phrased as a standard\nsupervised learning problem, both categorical readout (i.e. class identity) and a continous readout (i.e. object location) would be required.\nLocalization: Fig. 2d shows a two-step continuous localization task in which the agent is supposed to mark out the bounding box of an object by touching opposite corners on two successive timesteps, with reward proportionate to the Intersection over Union (IoU) value of the predicted bounding box relative to the ground truth bounding box IoU = Area(BGT\u2229B\u0302)\nArea(BGT\u222aB\u0302) . In localization, unlike the SR and\nMTS paradigms, the choice made at one timestep constrains the agent\u2019s optimal choice on a future timestep (e.g. picking the upper left corner of the bounding box on the first step contrains the lower right opposite corner to be chosen on the second).\nAlthough these tasks can become arbitrarily complex along certain axes, the tasks presented here require only fixed-length memory and future prediction. That is, each task requires only knowledge of the past kb timesteps, and a perfect solution always exists within kf timesteps from any point. The minimal required values of kb and kf are different across the various tasks in this work. However, in the investigations below, we set these to the maximum required values across tasks, i.e. kb = 1 and kf = 2. Thus, the agent is required to learn for itself when it is safe to ignore information from the past and when it is irrelevant to predict past a certain point in the future.\nWe will begin by considering a restricted case where the environment runs one semantic task indefinitely, showing how different architectures learn to solve such individual tasks with dramatically different levels of efficiency (\u00a7 2-3). We will then expand to considering the case where the environment\u2019s policy consists of a sequence of tasks with unpredictable transitions between tasks, and exhibit a meta-controller that can cope effectively with this expanded domain (\u00a7 4). 2 REWARD MAP PREDICTION\nThe TouchStream environment necessarily involves working with large action and state spaces. Methods for handling this situation often focus on reducing the effective size of action/state spaces, either via estimating pseudo-counts of state-action pairs, or by clustering actions (Ostrovski et al., 2017; Dulac-Arnold et al., 2015). Here we take another approach, using a neural network to directly approximate the (image-state modulated) mapping between the action space and reward space, allowing learnable regularities in the state-action interaction to implicitly reduce the large spaces into something manageable by simple choice policies. We introduce an off-policy algorithm for efficient multitask reinforcement learning in large action and state spaces: Reward Map Prediction, or ReMaP. 2.1 REMAP NETWORK ALGORITHM\nAs with any standard reinforcement learning situation, the agent seeks to learn an optimal policy \u03c0 = p(at | xt) defining the probability density p over actions given image state xt. The ReMaP algorithm is off-policy, in that \u03c0 is calculated as a simple fixed function of the estimated reward.\nA ReMaP network M\u0398 is a neural network with parameters \u0398, whose inputs are a history over previous timesteps of (i) the agent\u2019s own actions, and (ii) an activation encoding of the agent\u2019s state space; and which explicitly approximates the expected reward map across its action space for some number of future timesteps. Mathematically:\nM\u0398 : [\u03a8t\u2212kb:t,ht\u2212kb:t\u22121] 7\u2212\u2192 [ m1t ,m 2 t , . . . ,m kf t ] where kb is the number of previous timesteps considered; kf is the length of future horizon to be considered; \u03a8t\u2212kb:t is the history [\u03c8(xt\u2212kb), . . . , \u03c8(xt)] of state space encodings produced by fixed backbone network \u03c8(\u00b7), ht\u2212kb:t\u22121 is the history [at\u2212kb . . . , at\u22121] of previously chosen actions, and each mi \u2208 map(A,R) \u2013 that is, a map from action space to reward space. The predicted reward maps are constructed by computing the expected reward obtained for a subsample of actions drawn randomly from A:\nmjt : at 7\u2192 E [rt+j | at,ht\u2212kb:t\u22121,\u03a8t\u2212kb:t] = \u222b R rt+jp(rt+j | at,ht\u2212kb:t\u22121,\u03a8t\u2212kb:t). (1)\nwhere rt+j is the predicted reward j steps into the future horizon. Having produced kf reward prediction maps, one for each timestep of its future horizon, the agent needs to determine what\nit believes will be the single best action over all the expected reward maps [ m1t ,m 2 t , . . . ,m kf t ] .\nThe ReMaP algorithm formulates doing so by normalizing the predictions across each of these kf maps into separate probability distributions, and sampling an action from the distribution which has maximum variance. That is, the agent computes its policy \u03c0 as follows:\n\u03c0 = VarArgmax kf j=1{Dist[Norm[m j t ]]}, (2)\nwhere Norm[m] = m\u2212min\nx\u2208A m(x) (3)\nis a normalization that removes the minimum of the map,\nDist[m] = f(m)\u222b\nA f(m(x)) (4)\nensures it is a probability distribution parameterized by functional family f(\u00b7), and VarArgmax is an operator which chooses the input with largest variance.\nThe sampling procedure described in equation (2) uses two complementary ideas to exploit spatial and temporal structure to efficiently explore a large action space. Since rewards in real physical tasks are spatially correlated, the distribution-based sampler in Equation (4) allows for more effective exploration of potentially informative actions than would the single-point estimate of an apparent optimum (e.g. an -greedy policy). Further, in order to reduce uncertainty, the ReMaP algorithm explores timesteps with greatest reward map variance. The VarArgmax function nonlinearly upweights the timeframe with highest variance to exploit the fact that some points in time carry disproportianate relevance for reward outcome, somewhat analagously to how max-pooling operates in convolutional networks. Although any standard action selection strategy can be used in place of the one in (2) (e.g. pseudo -greedy over all kf maps), we have empirically found that this policy is effective at efficiently exploring our large action space.\nThe parameters \u0398 of a ReMaP network are learned by gradient descent on the loss of the reward prediction error \u0398\u2217 = argmin\u0398 L [mt(at), rt, ; \u0398] with map m j t compared to the true reward rt+j . Only the reward prediction in mt corresponding to the action chosen at timestep t participates in loss calculation and backpropagation of error signals. A minibatch of maps, rewards, and actions is collected over several consecutive inference passes before performing a parameter update.\nThe ReMaP algorithm is summarized in 1.\nAlgorithm 1: ReMaP \u2013 Reward Map Prediction Initialize ReMaP network M Initialize state and action memory buffers \u03a8t\u2212kb:t and ht\u2212kb:t\u22121 for timestep t = 1,T do\nObserve xt, encode with state space network \u03c8(\u00b7), and append to state buffer Subsample set of potential action choices at uniformly from A Produce kf expected reward maps of at from eq. (1) Select action according to policy \u03c0 as in (2) Execute action at in environment, store in action buffer, and receive reward rt Calculate loss for this and previous kf \u2212 1 timesteps if t \u2261 0 mod batch size then\nPerform parameter update\nThroughout this work, we take our fixed backbone state space encoder to be the VGG-16 convnet, pretrained on ImageNet (Simonyan & Zisserman, 2014). Because the resolution of the input to this network is 224x224 pixels, our action space A = {0, . . . , 223} \u00d7 {0, . . . , 223}. By default, the functional family f used in the action selection scheme in Eq. (4) is the identity, although on tasks benefiting from high action precision (e.g. Localization or MS-COCO MTS), it is often optimal to sample a low-temperature Boltzmann distribution with f(x) = e\u2212x/T . Reward prediction errors are calculated using the cross-entropy loss (where logits are smooth approximations to the Heaviside function in analogy to eq. (5)). 3 EFFICIENT NEURAL MODULES FOR TASK LEARNING\nThe main question we seek to address in this section is: what specific neural network structure(s) should be used in ReMaP modules? The key considerations are that such modules (i) should be easy to learn, requiring comparatively few training examples to discover optimal parameters \u0398\u2217, and (ii) easy to learn from, meaning that an agent can quickly build a new module by reusing components of old ones.\nIntuitive Example: As an intuition-building example, consider the case of a simple binary StimulusResponse task, as in Fig. 2a (\u201cif you see a dog touch on the right, if a butterfly touch on the left\"). One decision module that is a \u201cperfect\u201d reward predictor on this task is expressed analytically as:\nM [\u03a8t](ax, ay) = H(ReLU(W\u03a8t) \u00b7ReLU(ax) + ReLU(\u2212W\u03a8t) \u00b7ReLU(\u2212ax)) (5) whereH is the Heaviside function, ax and ay are the x and y components of the action a \u2208 A relative to the center of the screen, and W is a 1 \u00d7 |\u03a8t| matrix expressing the class boundary (bias term omitted for clarity). If W\u03a8t is positive (i.e. the image is of a dog) then ax must also be positive (i.e. touch is on the right) to predict positive reward; conversly, if W\u03a8t is negative (i.e. butterfly), ax must be negative (i.e. left touch) to predict reward. If neither of these conditions hold, both terms are equal to zero, so the formula predicts no reward. Since vertical location of the action does not affect reward, ay is not involved in reward calculation on this task.\nEquation (5) has three basic ideas embedded in its structure:\n\u2022 there is an early visual bottleneck, in which the high-dimensional general purpose feature representation \u03a8t is greatly reduced in dimension (in this case, from the 4096 features of VGG\u2019s FC6 layer, to 1) prior to combination with action space,\n\u2022 there is a multiplicative interaction between the action vector and (bottlenecked) visual features, and\n\u2022 there is symmetry, e.g. the first term of the formula is the sign-antisymmetric partner of the second term, reflecting something about the spatial structure of the task.\nIn the next sections, we show these three principles can be generalized into a parameterized family of networks from which the visual bottleneck (the W parameters), and decision structure (the form of equation (5)) can emerge naturally and efficienty via learning for any given task of interest. 3.1 THE EMS MODULE\nIn this section we define a generic ReMaP module which is lightweight, encodes all three generic design principles from the \u201cperfect\u201d formula, and uses only a small number of learnable parameters.\nDefine the concatenated square nonlinearity as\nSq : x 7\u2212\u2192 x\u2295 x2\nand the concatenated ReLU nonlinearity (Shang et al. (2016)) as CReLU : x 7\u2212\u2192 ReLU(x)\u2295ReLU(\u2212x)\nwhere \u2295 denotes vector concatenation. The CReS nonlinearity is then defined as the composition of CReLU and Sq, e.g.\nCReS(x) : x 7\u2212\u2192 ReLU(x)\u2295ReLU(\u2212x)\u2295ReLU2(x)\u2295ReLU2(\u2212x). The CReS nonlinearity introduces multiplicative interactions between its arguments via its Sq component and symmetry via its use of CReLU. Definition. The (n0, n1, . . . , nk)-Early Bottleneck-Multiplicative-Symmetric (EMS) module is the ReMaP module given by\nB = CReLU(W0\u03a8 + b0)\nl1 = CReS(W1(B \u2295 a) + b1) li = CReS(Wili\u22121 + bi) for i > 1\nwhere Wi and bi are learnable parameters, \u03a8 are features from the fixed visual encoding network, and a is the action vector in A.\nThe EMS structure builds in each of the three principles described above. The B stage represents the early bottleneck in which visual encoding inputs are bottlenecked to size n0 before being combined with actions, and then performs k CReS stages, introducing multiplicative symmetric interactions between visual features and actions. From this, the \u201cperfect\u201d module definition for the binary SR task in eq. (5) then becomes a special case of a two-layer EMS module. Note that the visual features to be bottlenecked can be from any encoder; in practice, we work with both fully connected and convolutional features of the VGG-16 backbone.\nIn the experiments that follow, we compare the EMS module to a wide variety of alternative control motifs, in which the early bottleneck, multiplicative, and symmetric features are ablated. Multiplicative nonlinearity and bottleneck ablations use a spectrum of more standard activation functions, including ReLU, tanh, sigmoid, elu (Clevert et al., 2015), and CReLU forms. In late bottleneck (fully-ablated) architectures \u2013 which are, effectively, \u201cstandard\u201d multi-layer perceptrons (MLPs) \u2013 action vectors are concatenated directly to the output of the visual encoder before being passed through subsequent stages. In all, we test 24 distinct architectures. Detailed information on each can be found in the Supplement. 3.2 EXPERIMENTS\nWe compared each architecture across 12 variants of visual SR, MTS, and localization tasks, using fixed visual encoding features from layer FC6 of VGG-16. Task variants ranged in complexity from simple (e.g. a binary SR task with ImageNet categories) to more challenging (e.g. a many-way ImageNet MTS task with result buttons appearing in varying positions on each trial). The most complex tasks are two variants of localization, either with a single main salient object placed on a complex background (similar to images used in Yamins & DiCarlo (2016)), or complex scenes from MS-COCO (see Fig. 3b). Details of the tasks used in these experiments can be found in the Supplement. Module weights were initialized using a normal distribution with \u00b5 = 0.0, \u03c3 = 0.01, and optimized using the ADAM algorithm (Kingma & Ba (2014)) with parameters \u03b21 = 0.9, \u03b22 = 0.999 and = 1e\u22128. Learning rates were optimized on a per-task, per-architecture basis in a cross-validated fashion. For each architecture and task, we ran optimizations from five different initialization seeds to obtain mean and standard error due to initial condition variability. For fully-ablated \u201clate-bottleneck\u201d modules, we measured the performance of modules of three different sizes (small, medium, and large), where the smallest version is equivalent in size to the EMS module, and the medium and large versions are much larger (Table S1).\nEmergence of Decision Structures: A key feature of ReMaP modules is that they are able to discover de novo the underlying output domain spaces for a variety of qualitatively distinct tasks (Fig. 3; more examples in Fig. S2). The emergent decision structures are highly interpretable and reflect the true interfaces that the environment implicitly defines. The spatiotemporal patterns of learning are robust across tasks and replicable across initial seedings, and thus might serve as a candidate model of interface use and learning in humans. In general, we observe that the modules typically discover\nthe underlying \u201cphysical structures\u201d needed to operate the task interface before learning the specific decision rules needed to solve the task.\nFor example, in the case of a discrete MTS categorization task (Fig. 3a), this involves the quick discovery of onscreen \u201cbuttons\u201d corresponding to discrete action choices before these buttons are mapped to their semantic meaning. In the case of in the MS-COCO MTS task (Fig. 3b), we observe the initial discovery of high salience object boundaries, and followed by category-specific refinement. It is important to note that the visual backbone was trained on a categorization task, quite distinct from the localization task in MS-COCO MTS. Thus, the module had to learn this very different decision structure, as well as the class boundaries of MS-COCO, from scratch during training.\nEfficiency of the EMS module: The efficiency of learning was measured by computing the taskaveraged, normalized area under the learning curve (TA-N-AUC) for each of the 24 modules tested, across all 12 task variants. Fig. 4a-d shows characteristic learning curves for several tasks, summarized in the table in Fig. 4e. Results for all architectures for all tasks are shown in Supplement Figure S1. We find that the EMS module is the most efficient across tasks (0.997 TA-N-AUC). Moreover, the EMS architecture always achieves the highest final reward level on each task.\nIncreasing ablations of the EMS structure lead to increasingly poor performance, both in terms of learning efficiency and final performance. Ablating the low-order polynomial interaction (replacing Sq with CReLU) had the largest negative effect on performance (0.818 TA-N-AUC), followed in importance by the symmetric structure (0.944 TA-N-AUC). Large fully-ablated models (no bottleneck, using only ReLU activations) performed significantly worse than the smaller EMS module and the single ablations (0.717 TA-N-AUC), but better than the module with neither symmetry nor multiplicative interactions (0.566 TA-N-AUC). Small fully-ablated modules with the same number of parameters as EMS were by far the least efficient (0.403 TA-N-AUC) and oftentimes achieved much lower final reward. In summary, the main conceptual features by which the special-case architecture in eq. (5) solves the binary SR task are both individually helpful, combine usefully, and can be parameterized and efficiently learned for a variety of visual tasks. These properties are critical to achieving effective task learning compared to standard MLP structures.\nIn a second experiment focusing on localization tasks, we tested an EMS module using convolutional features from the fixed VGG-16 feature encoder, reasoning that localization tasks could benefit from finer spatial feature resolution. We find that using visual features with explicit spatial information\nsubstantially improves task performance and learning efficiency on these tasks (Fig. 5). To our knowledge, our results on MS-COCO are the first demonstrated use of reinforcement learning to achieve instance-level object segmentations. Reward curves (measuring bounding box IoU) in Fig. 5a show little difference between any of the late bottleneck modules at any size. The only models to consistently achieve an IoU above 0.4 are the EMS-like variants, especially with convolutional features. For context, a baseline SVR trained using supervised methods to directly regress bounding boxes using the same VGG features results in an IoU of 0.369. 4 DYNAMIC NEURAL VOTING FOR TASK SWITCHING\nSo far, we\u2019ve considered the case where the TouchStream consists of only one task. However, agents in real environments are often faced with having to switch between tasks, many of which they may be encountering for the first time. Ideally, such agents would repurpose knowledge from previously learned tasks when it is relevant to a new task.\nFormally, we now consider environment policies consisting of sequences of tasks T = {\u03c41, \u03c42, ..., \u03c4\u2126}, each of which may last for an indeterminate period of time. Consider also a set of modulesM, where each module corresponds to a task-specific policy \u03c0\u03c9 (a | x) = p (at | xt, \u03c4\u03c9). When a new task begins, we cue the agent to allocate a new module M\u2126+1 which is added to the set of modulesM. In the learning that follows allocation, the weights in old modules are held fixed while the parameters in the new module M\u2126+1 are trained. However, the output of the system is not merely the output of the new module, but instead is a dynamically allocated mixture of pathways through the computation graphs of the old and new modules. This mixture is determined by a meta-controller (Fig. 6). The meta-controller is itself a neural network which learns a dynamic distribution over (parts of) modules to be used in building the composite execution graph. Intuitively, this composite graph is composed of a small number of relevant pathways that mix and match parts of existing modules to solve the new task, potentially in combination with new module components that need to be learned. 4.1 DYNAMIC NEURAL VOTING\nWe define a meta-controller that assigns weights to each layer in each module inM. Let pi\u03c9 be the weight associated with the ith layer in module \u03c9. These weights are probabilistic on a per layer basis, e.g. pi\u03c9 \u2265 0 and \u2211 \u03c9 p i \u03c9 = 1 and can be interpreted as the probability of the controller selecting the ith layer li\u03c9 for use in the execution graph, with distribution \u03c0i = {pi\u03c9}. For such an assignment of weights, the composite execution graph defined by the meta-controller is generated by computing the sum of the activations of all the components at layer i weighted by the probabilities pi\u03c9 . These values are then passed on to the next layer where this process repeats. Mathematically, the composite layer at stage i can be expressed as\nl\u0303iM = \u2211 \u03c9 pi\u03c9M i \u03c9(l\u0303 i\u22121 M ) = E\u03c0i [ M i\u03c9(l\u0303 i\u22121 M ) ] . (6)\n1\n1\n1\n1\n1\n1\n1\nwhere M i\u03c9(\u00b7) is the operator that computes the ith layer of module \u03c9, and l\u03030 := \u03c8(xt) is the original encoded input state.\nThe question now is, where do these probabilistic weights come from? The core of our procedure is a dynamic neural voting process in which the controller network learns a Boltzmann distribution over module activations to maximize reward prediction accuracy. This process is performed at each module layer, where the module weightings for a given layer are conditioned on the results of voting at the previous layer. That is,\npi = softmax [ W i (\u2295 \u03c9 M i\u03c9(l\u0303 i\u22121 M ) ) + bi ] (7)\nwhere pi = (pi0, p i 1, ..., p i \u2126) are the module weights at layer i, \u2295 is concatenation, and W i \u2208 R(\u2126\u00b7L)\u00d7\u2126 is a learnable weight matrix of the controller.\nThis voting procedure operates in an online fashion, such that the controller is continously learning its meta-policy while the agent is taking actions. As defined, the meta-controller constitutes a fully-differentiable neural network and is learned by gradient descent online.\nA useful refinement of the above mechanism involves voting across the units ofM. Specifically, the meta-controller now assigns probabilistic weights pi,j\u03c9 to neuron n i,j \u03c9 (the jth unit in layer i of module \u03c9). In contrast to the layer-voting scheme, the dynamically generated execution graph computed by the meta controller now becomes composite neurons with activations:\nn\u0303i,jM = \u2211 \u03c9 pi,j\u03c9 M i,j \u03c9 (l\u0303 i\u22121 M ) = E\u03c0i,j [ M i,j\u03c9 (l\u0303 i\u22121 M ) ] . (8)\nwhich are concatenated to form the composite layer l\u0303iM. The generalization of equation (7) to the single-unit voting scheme then becomes:\npi,j = softmax [ W i,j (\u2295 \u03c9 M i,j\u03c9 (l\u0303 i\u22121 M ) ) + bi,j ] (9)\nwhere pi,j = (pi,j0 , p i,j 1 , ..., p i,j \u2126 ) are the unit-level weights across modules, and W i,j \u2208 R\u2126\u00d7\u2126.\nEmpirically, we find that the initialization schemes of the learnable controller parameters are an important consideration in the design, and that two specialized transformations also contribute slightly to its overall efficiency. For details on these, please refer to the Supplement.\nThe dynamic neural voting mechanism achieves meta-control through a neural network optimized online via gradient descent while the modules are solving tasks, rather than a genetic algorithm that operates over a longer timescale as in the work of Fernando et al. (2017). Moreover, in contrast to the work of Rusu et al. (2016) the voting mechanism eliminates the need for fully-connected adaptation layers between modules, thus substantially reducing the number of parameters required for transfer. 4.2 SWITCHING EXPERIMENTS\nR eu se Fr ac tio\nn\n10 20 4030 Batch Updates 0\nR ew\nar d\nFrom Scratch After Switch\n0.6 0.8 1.0\n0.6 0.8 1.0\n0.2 0.4\nFigure 7: Dyanmic Neural Voting quickly corrects for \u201cno-switch\u201d switches. Although a new module is allocated for each task transition, if the new task is identitcal to the original task, the controller quickly learns to reuse the old module components. Top: postswitching learning curve for the EMS module on a binary stimulus-response task, after being trained on the same task. For clarity, only the Layer Voting method is compared against a baseline module trained from scratch. Bottom: fraction of the original module reused over the course of post-switch learning, calculated by averaging the voting weights of each layer in the original module.\n\u201cNo-switch\u201d switches: Our first experiments tested how the dynamic neural voting mechanism would respond to \u201cno-switch\u201d switches, i.e. ones in which although a switch cue was given and a new module allocated, the environment policy\u2019s task did not actually change (Fig 7). We find that in such cases, performance almost instantly approaches pre-switch levels (e.g. there is very little penalty in attempting an uneccessary switch). Moreover, we find that the weightings the controller applies to the new module is low: in other words, the system recognizes that no new module is needed and acts accordingly by concentrating its weights on the existing module. These results show that, while we formally assume that the agent is cued as when task switches occurs, in theory it could implement a completely autonomous monitoring policy, in which the agent simply runs the allocation procedure if a performance \u201canomoly\u201d occurs (e.g. a sustained drop in reward). If the system determines that the new module was unneeded, it could simply reallocate the new module for a later task switch. In future work, we plan to implement this policy explicitly.\n\u201cReal\u201d switches: We next tested how the dynamic voting controller handled switches in which the environment policy substantially changed after the switching cue. Using both the EMS module and (for control) the large fully-ablated module as described in \u00a7 3.2, the dynamic neural voting controller was evaluated on 15 switching experiments using multiple variants of SR and MTS tasks. Specifically, these 15 switches cover a variety of distinct (but not mutually exclusive) switching types including:\n\u2022 addition of new classes to the dataset (switch indexes 2, 7, 11 in the table of Fig. 8) \u2022 replacing the current class set entirely with a new non-overlapping class set (switch ids. 1, 3) \u2022 addition of visual variability to a previously less variable task (switch id. 6) \u2022 addition of visual interface elements e.g. new buttons (switch id. 8) \u2022 transformation of interface elements e.g. screen rotation (switch ids. 12, 13, 14, 15) \u2022 transitions between different task paradigms e.g. SR to MTS tasks and vice-versa (switch ids. 4, 5,\n9, 10).\nController hyperparameters were optimized in a cross-validated fashion (see Appendix G.1), and optimizations for three different initialization seeds were run to obtain mean and standard error.\nFigures 8a and b show characteristic post-switch learning curves for the EMS module for both the Layer Voting and Single-Unit Voting methods. Additional switching curves can be found in the Supplement. Cumulative reward gains relative to learning from scratch were quantified by Relative\nGain in AUC: RGain = AUC(M\nswitch)\u2212AUC(M) AUC(M) , where M is the module trained from scratch on\nR ew\nar d\nR el\nat iv\ne A\nU C\nG ai\nn\nSingle-Unit Voting Layer Voting EMS\nSingle-Unit Voting Layer Voting\nNone (Large)\n1 2 3 4 5 6 7 8 9 10 11 12\n10 20 504030 10 20 504030\n0.0\n0.1\n0.2\n0.4\n0.3\n0.5\n0.2 0.4 0.6\n1.0 0.8\n1.2\n0.2\n0.4\n0.6\n0.8\nEMS Single-Unit Voting Layer Voting\na b\nc d\ne\nTr an\nsf er\nG ai\nn\n0.0\n0.1\n0.2\n0.4\n0.3\n0.5\n1 142 103 84 95 6 7 11 12 13 15\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 13 14 15\n1.4\nBase Task Switch Task Base Task Switch Task 1. 2-way SR 2-way SR new classes 9. 4-way double-binary SR 4-way 4-shown stationary MTS 2. 2-way SR 4-way double-binary SR 10. 4-way 4-shown stationary MTS t 4-way quadrant SR 3. 2-way stationary MTS 2-way stationary MTS new classes 11. 2-way SR 4-way quadrant SR 4. 2-way SR 2-way stationary MTS 12. 4-way double-binary SR 4-way quadrant SR 5. 2-way stationary MTS 2-way SR 13. 2-way SR class reversal 6. 2-way stationary MTS 2-way vert-motion horiz-flip MTS 14. 2-way SR squeezed map 7. 2-way vert-motion horiz-flip MTS 4-way 2-shown vert-motion MTS 15. 2-way SR 90\u25e6 map rotation 8. 4-way 2-shown vert-motion MTS 4-way 4-shown permuted MTS\nTask switching with Dynamic Neural Voting. Post-Switching learning curves for the EMS module on the 4-way Quadrant SR task after learning a. 2-way SR task and b. a 4-way MTS task with 4 match screen class templates. Both the Layer Voting method and Single-Unit Voting method are compared against a baseline module trained on the second task from scratch. Across all twelve task switches, we evaluate the Relative Gain in AUC over baseline (RGain) using both voting methods for c. the EMS module and d. the large-sized fully-ablated late bottleneck MLP. e. Transfer Gain (TGain) metrics are compared for both module types for each of the voting mechanisms. Colors are as in c. (EMS module) and d. (fully-ablated module).\nFigure 8\nthe second task, and Mswitch is the module transferred from an initial task using the dynamic voting controller. We find that the dynamic voting controller allows for rapid positive transfer of both module types across all 15 task switches, and the general Single-Unit voting method is a somewhat better transfer mechanism than the Layer Voting method (Fig. 8c). Both the EMS module and the large fully-ablated module, which was shown to be inefficient on single-task performance in \u00a7 3.2, benefit from dynamic neural voting (Fig. 8 d).\nEMS modules are more \u201cswitchable\u201d: To quantify how fast switching gains are realized, we use Transfer Gain: TGain = \u2206maxT\u2206max , where T\u2206max = argmax(\u2206t) is the time where the maximum amount of reward difference between Mswitch and M occurs, and \u2206max is the reward difference at that time. Qualitatively, a high score on the Transfer Gain metric indicates that a large amount of relative reward improvement has been achieved in a short amount of time (see Figure S7 for a graphical illustration of the relationship between the RGain and TGain metrics). While both the EMS and large fully-ablated modules have positive Transfer Gain, EMS scores significantly higher on this metric, i.e. is significantly more \u201cswitchable\u201d than the large fully-ablated module (Fig. 8e). We hypothesize that this is due to the EMS module being able to achieve high task performance with significantly fewer units than the larger fully-ablated module, making the former easier for the dynamic neural voting controller to operate on. 5 CONCLUSION AND FUTURE DIRECTIONS\nIn this work, we introduce the TouchStream environment, a continual reinforcement learning framework that unifies a wide variety of spatial decision-making tasks within a single context. We describe a general algorithm (ReMaP) for learning light-weight neural modules that discover implicit task interfaces within this large-action/state-space environment. We show that a particular module architecture (EMS) is able to remain compact while retaining high task performance, and thus is especially suitable for flexible task learning and switching. We also describe a simple but general dynamic task-switching architecture that shows substantial ability to transfer knowledge when modules for new tasks are learned.\nA crucial future direction will be to expand insights from the current work into a more complete continual-learning agent. We will need to show that our approach scales to handle dozens or hundreds of task switches in sequence. We will also need to address issues of how the agent determines when to build a new module and how to consolidate modules when appropriate (e.g. when a series of tasks previously understood as separate can be solved by a single smaller structure). It will also be critical to extend our approach to handle visual tasks with longer horizons, such as navigation or game play with extended strategic planning, which will likely require the use of recurrent memory stores as part of the feature encoder.\nFrom an application point of view, we are particularly interested in using techniques like those described here to produce agents that can autonomously discover and operate the interfaces present in many important real-world two-dimensional problem domains, such as on smartphones or the internet (Grossman, 2007). We also expect many of the same spatially-informed techniques that enable our ReMaP/EMS modules to perform well in the 2-D TouchStream environment will also transfer naturally to a three-dimensional context, where autonomous robotics applications (Devin et al., 2016) are very compelling. SUPPLEMENTARY MATERIAL\n A TASK VARIANTS\nThe EMS module and all ablation controls were evaluated on a suite of 13 stimulus-response, match-to-sample, localization, and MS-COCO MTS variants:\n1. 2-way SR - standard binary SR task 2. 4-way double binary SR - four class variant of SR, where each class is assigned either to the right\nor left half of the action space 3. 4-way quadrant SR - four class variant of SR, where each class is assigned to only a quadrant of\nthe action space 4. 2-way stationary MTS - standard binary MTS task with stereotyped and non-moving match screens\n5. 2-way stationary horiz-flip MTS - two class variant MTS task where the match templates\u2019 horizontal placement is randomly chosen, but confined within the same vertical plane\n6. 2-way stationary vert-motion MTS - two class variant MTS task where the match templates\u2019 vertical position is randomly chosen, but each class is confined to a specific side\n7. 2-way stationary vert-motion horiz-flip MTS - two class variant MTS task where the match templates\u2019 positions are completely random\n8. 4-way 2-shown MTS - four class variant MTS task where only two class templates are shown on the match screen (appearing with random horizontal location as well)\n9. 4-way 2-shown vert-motion MTS - same as above, but with random vertical motion for the templates\n10. 4-way 4-shown stationary MTS - four class variant MTS task where all four class templates are shown on the match screen, but with fixed positions.\n11. 4-way 4-shown permuted MTS - same as above, but with randomly permuted locations of all match templates\n12. Localization - Localization task 13. MS-COCO MTS - 80-way MTS task using the MS-COCO detection challenge dataset, where\nmatch screens are randomly samples scenes from the dataset B EXPERIMENT DETAILS AND DATASETS\nStimulus-Response Experiment Details: Image categories used are drawn from the Image-Net 2012 ILSVR classification challenge dataset Deng et al. (2009). Four unique object classes are taken from the dataset: Boston Terrier, Monarch Butterfly, Race Car, and Panda Bear. Each class has 1300 unique training instances, and 50 unique validation instances.\nMatch-To-Sample Experiment Details: Sample screen images drawn from the same Image-Net class set as the Stimulus-Response tasks. One face-centered, unobstructed class instance is also drawn from the Image-Net classification challenge set and used as a match screen template image for that class. Class template images for the match screen were held fixed at 100x100 pixels. For all variants of the MTS task, we keep a six pixel buffer between the edges of the screen and the match images, and a twelve pixel buffer between the adjascent edges of the match images themselves. Variants without vertical motion have the match images vertically centered on the screen.\nLocalization Experiment Details: The Localization task uses synthetic images containing a single main salient object placed on a complex background (similar to images used in Yamins & DiCarlo (2016); Yamins et al. (2014)). There are a total of 59 unique classes in this dataset. In contrast to other single-class localization datasets (e.g. Image-Net) which are designed to have one large, face-centered, and centrally-focused object instance and for which a trivial policy of \u201calways poke in image corners\u201d could be learned, this synthetic image set offers larger variance in instance scale, position, and rotation so the agent is forced into learning non-trivial policies requiring larger precision in action selection.\nMS-COCO MTS Experiment Details This task uses the entire MS-COCO detection challenge dataset Lin et al. (2014). On every timestep, a sample screen chosen from one of the 80 MS-COCO classes. These are constructed to be large, unobstructed, face centered representations of the class. For the match screen, we sample a random scene from MS-COCO containing any number of objects, but containing at least a single instance of the sample class. The agent is rewarded if its action is located inside any instance of the correct class. Both modules use sample actions from a low-temperature Boltzmann policy from eq. (4), which was empirically found to result in more precise reward map prediction. C MODULES\n C.1 UNITS PER LAYER\nTable S1 aggregates the number of units per layer for the EMS and ablated modules which was used when conducting single-task and task-switching experiments. Only fully-connected modules\u2019 layer sizes are shown here. For details on the convolutional bottleneck EMS module, please refer to C.2.\nTable S1: Number of units per layer for investigated modules\nBase-task EMS No symm No Mult\nNo mult/symm NoneSmall NoneMed NoneLarge\nSR 8 8 8 8 8 128 512 MTS 32 32 32 32 32 128 512 LOC 128 128 128 128 128 512 1024\n2-way SR\n4-way double binary SR\n4-way quadrant SR\n2-way stationary MTS\n2-way vert. motion MTS\n2-way horiz. flip MTS\n2-way motion/flip MTS\n4-way 2-shown MTS\n4-way 2-shown vert-motion MTS\n4-way 4-shown stationary MTS\n4-way 4-shown permuted MTS\nLocalization\nP ar\ntia l s\nym m E M\nS\nN o\nsy m m N o sy m m /p at ia l m\nul t\nN on\ne C\nR eL\nu (la\nrg e)\nN o\nm ul\nt/s ym\nm e Lu N o m ul t/s ym m R eL u N on e R eL u (m ed iu m ) N o m ul t N on e eL u (la rg e) N on e R eL u (la rg e) N on e C R eL u (m ed iu m ) N on e eL u (m ed iu m ) N on e C R eL u (s m al l) N on e ta nh (l ar ge ) N on e si g (la rg e) N o m ul t/s ym m ta nh\nN on\ne eL\nu (s\nm al\nl)\nN o\nm ul\nt/s ym\nm s ig N on e R eL u (s m al\nl)\nN on\ne si\ng (m\ned iu\nm )\nN on\ne ta\nnh (m\ned iu\nm )\nN on\ne si\ng (s\nm al\nl)\nN on\ne ta\nnh (s\nm al\nl)\nNormalized Validation AUC\n1.0\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\nFigure S1: Exhaustive module performance study of the EMS module and 23 ablation control modules, measured as the Area Under the Curve for all SR, MTS, and LOC task variants. Shown is the AUC normalized to the highest performing module in a task. Results in fig. 4 have further averaged this over the vertical task axis, and report only a salient subset of the ablations. C.2 THE CONVOLUTIONAL-EMS MODULE\nThis is a \"Convolutional Bottleneck\" extension of the EMS module shown in the paper, where skip connections link the conv5 and the FC6 representation of the visual backbone. Here, the \"scenelevel\" representation stored in the FC6 ReMaP memory buffer is tiled spatially to match the present convolution dimensions (here 14x14), and concatenated onto its channel dimension. A series of 1x1 convolutions plays the role of a shallow visual bottleneck, before the activations are vectorized and concatenated with A as input to the CReS layers of the standard EMS module. The results in the paper are shown for a bottleneck consisting of a single tanh and two CReS convolutions, with 128 units each. The Downstream layers use 128 units each as well.\nThe motivation for the convolutional bottleneck is that lower-level features are useful for complex spatial tasks such as Localization and Object Detection, and hence may result in a more precise policy. By tiling the entire scene-level representation along the convolution layer\u2019s channel dimension, a form of multiplicative template-matching is possible between objects that must be memorized (e.g. MS-COCO MTS templates) and what is inside the present scene. D EXHAUSTIVE ABLATION STUDY\nIn all, we investigated 23 distinct ablations on the EMS module, across all twelve task variants outlined in sec A (Fig. S1). Symmetry ablations replace CReS with the activation x 7\u2192 ReLU(x) \u2295 x2 Multiplicative ablations are denoted by specifying the nonlinearity used in place of CReS (where this is one of ReLU, tanh, sigmoid, elu Clevert et al. (2015), or CReLU Shang et al. (2016)). This additionally includes one partial symmetry ablation (denoted \u201cpartial symm\u201d) where only the visual bottleneck is symmetric, and one which ablates the ReLU from the \u201cno symm\u201d module (denoted \u201cno symm/partial-mult\u201d).\nTable S2: Module learning rates\n2-way SR\n4-way double binary SR\n4-way stationary SR\n2-way stationary MTS\n2-way vertmotion MTS\n2-way horiz flip MTS 2-way motion/flip MTS\n4-way 2-shown MTS 4-way 2-shown vertmotion MTS 4-way 4-shown stationary MTS 4-way 4-shown permuted MTS LOC\nEMS 10\u22123 10\u22123 10\u22123 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 10\u22124 Partial symm 10\u22123 10\u22123 10\u22123 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 5\u00b710\u22124 10\u22124 No symm 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 2\u00b710\u22124 10\u22124 No symm/partial mult 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 2\u00b710\u22124 10\u22124 No mult/symm ReLU 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 No mult/symm tanh 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 No mult/symm sig 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 No mult/symm eLU 10\u22123 10\u22123 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 No mult/symm CReLU 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 None ReLU(small) 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 None ReLU(medium) 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 None ReLU(large) 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 None tanh(small) 10\u22124 10\u22124 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 None tanh(medium) 10\u22124 10\u22124 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 None tanh(large) 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 None sig(small) 10\u22124 10\u22124 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22123 10\u22124 None sig(medium) 10\u22124 10\u22124 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 None sig(large) 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 None eLU(small) 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 None eLU(medium) 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22123 10\u22124 10\u22124 None eLU(large) 10\u22124 10\u22124 10\u22123 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 None CReLU(small) 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 10\u22123 10\u22123 10\u22123 10\u22123 10\u22123 10\u22124 None CReLU(medium) 10\u22123 10\u22123 10\u22123 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22123 10\u22124 10\u22124 None CReLU(large) 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 10\u22124 D.1 HYPERPARAMETERS\nLearning rates for the ADAM optimizer were chosen on a per-task basis through cross-validation on a grid between [10\u22124,10\u22123] for each architecture. Values used in the present study may be seen in Table S2. E ADDITIONAL MS-COCO REWARD MAPS\nFive additional reward map examples for the MS-COCO MTS task are provided in in Figure S2. Examples are plotted over the course of learning. F ADDITIONAL LEARNING CURVES\n F.1 SINGLE-TASK ABLATION EXPERIMENTS\nLearning trajectories for seven additional tasks are provided in Figure S3. Modules capable of convergence on a task were run until this was acheived, but AUC values for a given task are calculated at the point in time when the majority of models converge. F.2 DYNAMIC VOTING CONTROLLER AND EMS MODULE TASK-SWITCHING EXPERIMENTS\nAdditional trajectories for ten unshown switching curves are provided in Figure S4. G DYNAMIC VOTING CONTROLLER AUGMENTATIONS\n G.1 LEARNABLE PARAMETER INITIALIZATIONS\nHere we describe the weight initialization scheme that was found to be optimal for use with the dynamic voting controller. For simplicity, consider the layer-voting mechanism, with learnable\n64 128 704 1344 1984\n64 128 704 1344 1984\n64 704 1984 2624 3904\n64 704 1344 2176 3904\nSample Match\n64 704 1344 6016 8384\nReward Maps\nTraining episode (in thousands)\nFigure S2: Examples of the emergence of decision interfaces in MSCOCO MTS Reward map predictions over the course of training for 5 different object classes.\n2-way SR 4-way double binary SR\n2-way stationary MTS 2-way horiz-flip MTS\n2-way vert-motion MTS 4-way 2-shown MTS\n4-way 4-shown stationary MTS EMS\nNo symm No mult\nNone (large) No mult/symm None (medium)\nNone (small) Training Episodes\nR ew\nar d\nFigure S3: Additional Single-task performance ablation Learning curves. Seven learning curves shown for task variants not seen in the main text body. Shown are the same ablations as the main text.\nR ew\nar d\nTraining Episodes\n2-way SR to 2-way SR new classes 2-way SR to 4-way double binary SR\n2-way stationary MTS to 2-way stationary MTS new classes 2-way SR to 2-way stationary MTS\n2-way stationary MTS to 2-way SR 2-way stationary MTS to 2-way vert-motion horiz-flip MTS\n4-way 2-shown vert-motion MTS to 4-way 4-shown permuted MTS2-way vert-motion horiz-flip MTS to 4-way 2-shown vert-motion MTS\n4-way double binary SR to 4-way 4-shown stationary MTS 4-way double binary SR to 4-way quadrant SR\nEMS Single-Unit Voting Layer Voting\nFigure S4: Additional Switching curves. Ten additional learning curves for unshown task switches in main text. Shown are the both Single-Unit and Layer Voting implementations of the dynamic voting controller with the EMS module. \u201cEMS\u201d denotes a module trained on the second task from scratch.\n22\nweight matricies W i and biases bi. The intended biasing scheme is achieved through initializing the elements of these parameters to:\nW i\u03c9 \u223c  |N (\u00b50, 0.001)| if i = 1, \u03c9 < \u2126 |N (\u00b51, 0.001)| if i > 1, \u03c9 < \u2126 |N (0.01, 0.001)| if \u03c9 = \u2126\n(10)\nbi\u03c9 =  b0 if i = 1, \u03c9 < \u2126\nb1 if i > 1, \u03c9 < \u2126 0.1 if \u03c9 = \u2126\n(11)\nThis initialization technique was also generalized for use with the single-unit voting mechanism.\nFor the switching experiments presented in section \u00a7 4.2, we sweep the hyperparameters on a narrow band around the default scheme. The ranges for these are: \u00b50 \u2208 [0.01, 0.005] , b0 \u2208 [0.1, 0.01] , \u00b51 \u2208 [0.01, 0.02] , and b1 \u2208 [0.1, 0.2, 0.5, 1.0]. G.2 TARGETED TRANSFORMATIONS\nTwo additional switching mechanisms were added to the controller to augment its ability to switch between taks which are remappings of the action space or reward policy of a preexisting module. G.2.1 ACTION TRANSFORMATIONS\nwe note that efficient modules are those which can effectively produce a minimal representation of the interaction between action space A and observation xt. If the agent\u2019s optimal action space shifts to A\u2032 while the remainder of the task context remains fixed, the controller should allow for rapid targeted remapping A 7\u2192 A\u2032. Since we formulate the modules as ReMaP Networks, and A is an input feature basis, we can achieve remappings of this form through a fully-connected transformation:\na\u2032\u03c4 = f(Waa\u03c4 + b) (12)\nwhere a\u03c4 = [ht\u2212kb:t\u22121, at] is the vector of action histories, and Wa and b embed a\u03c4 into new action space A\u2032 using only a small number of learnable parameters. Pseudo Identity-Preserving Transformation In practice, we initialize the parameters in eq. (12) such that the transformation is pseudo identity-preseving, meaning that the representation learned at this level in the original module is not destroyed prior to transfer.\nThis is done by initializing Wa to be an identity matrix I|a\u03c4 | with a small amount of Gaussian noise \u223c N (0.0, \u03c32) added to break symmetry. b is initialized to be a vector of ones of size |a\u03c4 |. G.2.2 REWARD MAP TRANSFORMATIONS\nEach of the kf maps mt(x) reflects the agent\u2019s uncertainty in the environment\u2019s reward policy. If the task context remains stationary, but the environment transitions to new reward scheduleR\u2032 that no longer aligns with the module\u2019s policy \u03c0, the controller could to this transition by e.g. containing a mechanism allowing for targeted transformation of m(x) and hence also \u03c0.\nOne complication that arises under ReMaP is that since each task-module learns its optimal action space internally, m(x) are in the basis of R rather than A. Therefore, transformations on the map distribution must also re-encode A before mapping toR\u2032. In this work, we investigate a shallow \u201cadapter\u201d neural network that lives on top of the existing module and mapsR 7\u2192 R\u2032. Its first and second layers are defined by\nl1(x) = f(W1[m(x) g(a\u03c4 ), a\u03c4 ] + b1 (13)\nm(x)\u2032 \u221dW2l1 + b2 (14)\nwhere g(a\u03c4 ) is a similar transformation onA as above, denotes elementwise multiplication,W1 is a learnable matrix embedding into a hidden state, and W2 \u2208 R|l1|\u00d7|R\n\u2032| is a learnable matrix embedding intoR\u2032\nPseudo Identity-Preserving Transformation Similar to the transformation on the action space, we modify the reward-map transformation to be pseudo identity-preserving as well. This is done by modifying eq. (13) such that the original maps are concatenated on to the beginning of the transformation input vector:\nl1(x) = f(W1[m(x),m(x) g(a\u03c4 ), a\u03c4 ] + b1 (15)\nThe intended map-preserving transformation is accomplished via initializing W1 and W2 as: W (i,j) \u223c {\n1.0 +N (0.0, ) if i = j, i < R N (0.0, ) otherwise (16) G.3 TARGETED TRANSFORMATION HYPERPARAMETERS\nBoth of the targeted transformations have several hyperparameters. We conducted a grid search to optimize these in a cross-validated fashion, on a set of test task switches designed to be solved by one of the targeted transformations (Fig. S5). Each was conducted independently of the dynamic voting controller, and independently of the other transformation. Optimal hyperparameters found in these experiments were fixed for use in the integrated dynamic voting controller, and were not further optimized afterwards.\nAction Transformation Hyperparameters We conducted three tests using the stimulus-response paradigm: class reversal (in which the left class becomes the right class and vice-versa), a horizontal rotation of the reward boundaries (such that right becomes up and left becomes down), and a \u201cswitch\u201d to the original task (intended to test the identity-preserving component).\nIn this work, we find that a single, non-activated linear transformation (f in (12)) is optimal for this new state-space embedding, using kb \u2217 2 units, and initialized such that the idendity-preserving transformation weights have \u03c3 = 0.01. The learning rate for this transformation was found to be optimal at 0.1.\nReward Map Transformation Hyperparameters We conducted two tests using the stimulusresponse paradigm: a \u201csqueezing\u201d task (where there is no longer any reward dispensed on the lower half of the screen), and a \u201cswitch\u201d to the original task (intended to test the identity-preserving component).\nIn this work, we find the optimal activations in eq. (15) to be f(\u00b7) = CReS and g(\u00b7) = ReLU, with 4 units in the hidden layer. in the weight initialization scheme was found optimal at 0.001, and an initial bias of 0.01. The optimal learning rate for this transformation was found to be 0.01. G.4 TRANSFORM ABLATION\nA study was conducted to determine the relative benefit of the targeted transformations (Fig. S6), where it was determined that the primary contribution of the dynamic neural controller was in fact the voting mechanism (although the transformations did supplement this as well). G.5 DEPLOYMENT SCHEME OF TASK MODULES\nWhen cued into task transition, the controller freezes the learnable parameters of the old task-module, and deploys a new unitialized task-module. The controller then initializes the action and reward map transformation networks as described in G.2 on top of the old module. These transformations are also voted on inside the dynamic neural controller at every timestep. H SWITCHING METRICS\nFigure S7 graphically illustrates the metrics used inside the paper to quantify switching performance: RGain and TGain.\na Stimulus-Response\nimage:\nreward map:\nEnvironment Reward Policy Reversal\nTask Switch\nTraining Trials\nb\nTraining Trials\nStimulus-Response\nimage:\nreward map:\nEnvironment Reward Policy Rotation\nTask Switch\nc\nTraining Trials\nStimulus-Response\nimage:\nreward map:\nEnvironment Reward Policy Squeeze\nTask Switch\nFigure S5: Action and reward map transformation switch examples. Three task switching experiments were performed to optimize the hyperparameters of the targeted transformations that augment the dynamic neural voting controller. These switches are also retested in the fully-integrated meta-controller and shown in the original switching result figure. a. Binary stimulus-response class reversals, where the left class becomes the right class, and vice-versa. b. Rotations of the binary stimulus-response reward boundaries. c. A \u201csqueezing\u201d of the binary stimulus-response reward boundaries, where no reward is given on the new task on the bottom half of the screen, regardless of class shown.\nR el\nat iv\ne A\nU C\nG ai\nn\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 \u22120.1\n0.0\n0.1\n0.2\n0.3\n0.4\n0.5\nFigure S6: Targeted controller transform ablation. Relative AUC Gain for the EMS module over the same switching snearios in the paper, but with the targeted transformations ablated.\n10 20 25 3015 40\n0.5\n0.6\n0.7\n0.8\n0.9\n1.0\nR ew\nar d\n5\nTraining episodes\n\u0394max\nt\u0394max\nt\u0394max\n\u0394maxTGain =\nAUC( t)RGain = AUC(Bt)\nFigure S7: Illustration of switching performance metrics. We quantify the switching performance of the dynamic neural controller and task-modules by two metrics: \u201crelative gain in AUC\u201d (ratio of green to purple shaded regions), and \u201ctransfer\u201d gain (difference of reward at T\u2206max). Relative AUC measures the overall gain relative to scratch, and the transfer gain measures the speed of transfer. Curve shown is the EMS module with Single-Unit voting method evaluated on a switch from a 4-way MTS task with two randomly moving class templates to a 4-way MTS task with four randomly moving templates.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Perturbative black box variational inference , Authors: Robert Bamler Cheng Zhang... , 2017\n2.  Title: Nice: Non-linear independent components estimation , Authors: Laurent Dinh David Krueger... , 2014\n3.  Title: Adversarial message passing for graphical models , Authors: Theofanis Karaletsos , 2016\n4.  Title: Improving variational inference with inverse autoregressive flow , Authors: Diederik P Kingma Tim Salimans... , 2016\n5.  Title: Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks , Authors: Lars Mescheder Sebastian Nowozin... , 2017\n6.  Title: Masked autoregressive flow for density estimation , Authors: George Papamakarios Iain Murray... , 2017\n7.  Title: On the opportunities and pitfalls of nesting monte carlo estimators , Authors: Tom Rainforth Robert Cornish... , 2017\n8.  Title: Tighter variational bounds are not necessarily better , Authors: Tom Rainforth Tuan Anh Le... , 2017\n9.  Title: Markov chain monte carlo and variational inference: Bridging the gap , Authors: Tim Salimans Diederik Kingma... , 2015\n10.  Title: A collapsed variational bayesian inference algorithm for latent dirichlet allocation , Authors: Yee W Teh David Newman... , 2007\n11.  Title: Deep and hierarchical implicit models , Authors: Dustin Tran Rajesh Ranganath... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["The importance-weighted autoencoder (IWAE) approach of Burda et al. (2015) defines a sequence of increasingly tighter bounds on the marginal likelihood of latent variable models. Recently, Cremer et al. (2017) reinterpreted the IWAE bounds as ordinary variational evidence lower bounds (ELBO) applied to increasingly accurate variational distributions. In this work, we provide yet another perspective on the IWAE bounds. We interpret each IWAE bound as a biased estimator of the true marginal likelihood where for the bound defined on K samples we show the bias to be of order O(K\u22121). In our theoretical analysis of the IWAE objective we derive asymptotic bias and variance expressions. Based on this analysis we develop jackknife variational inference (JVI), a family of bias-reduced estimators reducing the bias to O(K\u2212(m+1)) for any given m < K while retaining computational efficiency. Finally, we demonstrate that JVI leads to improved evidence estimates in variational autoencoders. We also report first results on applying JVI to learning variational autoencoders.1", "The IWAE bound and other Monte Carlo objectives have been analyzed by independently by Rainforth et al. (2017a). Their analysis is more general than our IWAE analysis, but does not propose a method to reduce bias. Delta-method variational inference (DVI) proposed by Teh et al. (2007) is the closest method we are aware of and we discuss it in detail as well as provide novel results in Appendix B, page 14. Another exciting recent work is perturbative variational inference (Bamler et al., 2017) which considers different objective functions for variational inference; we are not sure whether there exists a deeper relationship to debiasing schemes. There also exists a large body of work that uses the ELBO objective but considers ways to enlarge the variational family. This is useful because the larger the variational family, the smaller the bias. Non-linear but invertible transformations of reference densities have been used initially for density estimation in NICE (Dinh et al., 2014) and for variational inference in Hamiltonian variational inference (Salimans et al., 2015). Around the same time the general framework of normalizing flows (Rezende & Mohamed, 2015) unified the previous works as some invertible continuous transformation of a distribution. Since then a large number of specialized flows with different computational requirements and flexibility have been constructed: inverse autoregressive flows (Kingma et al., 2016), masked autoregressive flows Papamakarios et al. (2017), and Householder flows (Tomczak & Welling, 2016). Another way to improve the flexibility of the variational family has been to use implicit models (Mohamed & Lakshminarayanan, 2016) for variational inference; this line of work includes adversarial variational Bayes (Mescheder et al., 2017), wild variational inference (Li & Liu, 2016), deep implicit models (Tran et al., 2017), implicit variational models (Husza\u0301r, 2017), and adversarial message passing approximations (Karaletsos, 2016).", "The authors introduce jackknife variational inference (JVI), a method for debiasing Monte Carlo objectives such as the importance weighted auto-encoder. Starting by studying the bias of the IWAE bound for approximating log-marginal likelihood, the authors propose to make use of debiasing techniques to improve the approximation. For the binarized MNIST the authors show improved approximations given the same number of samples from the auxiliary distribution q(z|x).\n\nJVI seems to be an interesting extension of, and perspective on, the IWAE bound (and other Monte Carlo objectives). Some questions and comments:\n\n* The Cremer et al. (2017) paper contains some errors when interpreting the IWAE bound as a standard ELBO with a more flexible variational approximation distribution. For example eq. (1) in their paper does not correspond to an actual distribution, it is not properly normalized. This makes the connection in their section 2.1 unclear. I would suggest citing the following paper instead for this connection and the relation to importance sampling (IS):\nNaesseth, Linderman, Ranganath, Blei, \"Variational Sequential Monte Carlo\", 2017.\n\n* Regarding the analysis of the IWAE bound the paper by Rainforth et al. (2017) mentioned in the comments seems very relevant. Also, because of the strong connection between IWAE and IS detailed in the Naesseth et al. (2017) paper it is possible to make use of a standard Taylor approximation/delta methods to derive Prop. 1 and Prop. 2, see e.g. Robert & Casella, \"Monte Carlo Statistical Methods\" or Liu's \"Monte Carlo Strategies for Scientific Computing\".\n\n* It could be worth mentioning that the JVI objective function is now no longer (I think?) a lower bound to the log-evidence.\n\n* Could the surprising issue (IWAE-learned, JV1-evaluated being better than JV1-learned, JV1-evaluated) in Table 1 be because of different local optima?\n\n* Also, we can easily get unbiased estimates of the evidence p(x) using IS and optimize this objective wrt to model parameters. The proposal parameters can be optimized to minimize variance, how do you think this compares to the proposed method?\n\nMinor comments:\n* p(x) -> p_\\theta(x)\n* In the last paragraph of section 1 it seems like you claim that the expressiveness of p_\\theta(x|z) is a limitation of VAE. It was a bit unclear to me what was actually a general limitation of maximum likelihood versus the approximation based on VAEs.\n* Last paragraph of section 1, \"strong bound\" -> \"tight bound\"\n* Last paragraph of section 2, citation missing for DVI\n\n7: Good paper, accept"], "input": "1 INTRODUCTION\nVariational autoencoders (VAE) are a class of expressive probabilistic deep learning models useful for generative modeling, representation learning, and probabilistic regression. Originally proposed in Kingma & Welling (2013) and Rezende et al. (2014), VAEs consist of a probabilistic model as well as an approximate method for maximum likelihood estimation. In the generative case, the model is defined as\np(x) = \u222b p\u03b8(x|z) p(z) dz, (1)\nwhere z is a latent variable, typically a high dimensional vector; the corresponding prior distribution p(z) is fixed and typically defined as a standard multivariate Normal distributionN (0, I). To achieve an expressive marginal distribution p(x), we define p\u03b8(x|z) through a neural network, making the model (1) a deep probabilistic model.\nMaximum likelihood estimation of the parameters \u03b8 in (1) is intractable, but Kingma & Welling (2013) and Rezende et al. (2014) propose to instead maximize the evidence lower-bound (ELBO),\nlog p(x) \u2265 Ez\u223cq\u03c9(z|x) [ log\np\u03b8(x|z) p(z) q\u03c9(z|x)\n] (2)\n=: LE . (3)\nHere, q\u03c9(z|x) is an auxiliary inference network, parametrized by \u03c9. Simultaneous optimization of (2) over both \u03b8 and \u03c9 performs approximate maximum likelihood estimation in the model p(x) of (1) and forms the standard VAE estimation method.\n1The implementation is available at https://github.com/Microsoft/ jackknife-variational-inference\nIn practice LE is estimated using Monte Carlo: we draw K samples zi \u223c q\u03c9(z|x), then use the unbiased estimator L\u0302E of LE ,\nL\u0302E = 1\nK K\u2211 i=1 log p\u03b8(x|zi) p(zi) q\u03c9(zi|x) . (4)\nThe VAE approach is empirically very successful but are there fundamental limitations? One limitation is the quality of the model p\u03b8(x|z): this model needs to be expressive enough to model the true distribution over x. Another limitation is that LE is only a lower-bound to the true likelihood. Is this bound tight? It can be shown, Kingma & Welling (2013), that when q(z|x) = p(z|x) we have LE = log p(x), hence (2) becomes exact. Therefore, we should attempt to choose an expressive class of distributions q(z|x) and indeed recent work has extensively investigated richer variational families. We discuss these methods in Section 7 but now review the importance weighted autoencoder (IWAE) method we build upon. 2 BURDA\u2019S IMPORTANCE-WEIGHTED AUTOENCODER (IWAE) BOUND\nThe importance weighted autoencoder (IWAE) method Burda et al. (2015) seemingly deviates from (2) in that they propose the IWAE objective, defined for an integer K \u2265 1,\nlog p(x) \u2265 Ez1,...,zK\u223cq\u03c9(z|x)\n[ log 1\nK K\u2211 i=1 p\u03b8(x|zi) p(z) q\u03c9(zi|x)\n] (5)\n=: LK . (6)\nWe denote with L\u0302K the empirical version which takes one sample z1, . . . , zK \u223c q\u03c9(z|x) and evaluates the inner expression in (6). We can see that L1 = LE , and indeed Burda et al. (2015) further show that LE = L1 \u2264 L2 \u2264 \u00b7 \u00b7 \u00b7 \u2264 log p(x), (7) and limK\u2192\u221e LK = log p(x). These results are a strong motivation for the use of LK to estimate \u03b8 and the IWAE method can often significantly improve over LE . The bounds LK seem quite different from LE , but recently Cremer et al. (2017) and Naesseth et al. (2017) showed that an exact correspondence exists: any LK can be converted into the standard form LE by defining a modified distribution qIW(z|x) through an importance sampling construction.\nWe now analyze the IWAE bound L\u0302K in more detail. Independently of our work Rainforth et al. (2017a) has analysed nested Monte Carlo objectives, including the IWAE bound as special case. Their analysis includes results equivalent to our Proposition 1 and 2. 3 ANALYSIS OF THE IWAE BOUND\nWe now analyze the statistical properties of the IWAE estimator of the log-marginal likelihood. Basic consistency results have been shown in Burda et al. (2015); here we provide more precise results and add novel asymptotic results regarding the bias and variance of the IWAE method. Our results are given as expansions in the order K of the IWAE estimator but do involve moments \u00b5i which are unknown to us. The jackknife method in the following sections will effectively circumvent the problem of not knowing these moments.\nProposition 1 (Expectation of L\u0302K). Let P be a distribution supported on the positive real line and let P have finite moments of all order. Let K \u2265 1 be an integer. Let w1, w2, . . . , wK \u223c P independently. Then we have asymptotically, for K \u2192\u221e,\nE[L\u0302K ] = E [ log 1\nK K\u2211 i=1 wi\n] = logE[w]\u2212 1\nK \u00b52 2\u00b52 + 1 K2 ( \u00b53 3\u00b53 \u2212 3\u00b5 2 2 4\u00b54 ) \u2212 1 K3 ( \u00b54 4\u00b54 \u2212 3\u00b5 2 2 4\u00b54 \u2212 10\u00b53\u00b52 5\u00b55 ) + o(K\u22123), (8)\nwhere \u00b5i := EP [(w \u2212 EP [w])i] is the i\u2019th central moment of P and \u00b5 := EP [w] is the mean.\nProof. See Appendix A, page 12.\nThe above result directly gives the bias of the IWAE method as follows.\nCorollary 1 (Bias of L\u0302K). If we see L\u0302K as an estimator of log p(x), then for K \u2192\u221e the bias of L\u0302K is\nB[L\u0302K ] = E[L\u0302K ]\u2212 logE[w] (9)\n= \u2212 1 K \u00b52 2\u00b52 + 1 K2 ( \u00b53 3\u00b53 \u2212 3\u00b5 2 2 4\u00b54 ) \u2212 1 K3 ( \u00b54 4\u00b54 \u2212 3\u00b5 2 2 4\u00b54 \u2212 10\u00b53\u00b52 5\u00b55 ) + o(K\u22123). (10)\nProof. The bias (10) follows directly by subtracting the true value log p(x) = logE[w] from the right hand side of (8).\nThe above result shows that the bias is reduced at a rate ofO(1/K). This is not surprising because the IWAE estimator is a smooth function applied to a sample mean. The coefficient of the leadingO(1/K) bias term uses the ratio \u00b52/\u00b52, the variance divided by the squared mean of the P distribution. The quantity \u221a \u00b52/\u00b52 is known as the coefficient of variation and is a common measure of dispersion of a distribution. Hence, for large K the bias of L\u0302K is small when the coefficient of variation is small; this makes sense because in case the dispersion is small the logarithm function behaves like a linear function and few bias results. The second-order and higher-order terms takes into account higher order properties of P .\nThe bias is the key quantity we aim to reduce, but every estimator is also measured on its variance. We now quantify the variance of the IWAE estimator.\nProposition 2 (Variance of L\u0302K). For K \u2192\u221e, the variance of L\u0302K is given as follows.\nV[L\u0302K ] = 1\nK \u00b52 \u00b52 \u2212 1 K2 ( \u00b53 \u00b53 \u2212 5\u00b5 2 2 2\u00b54 ) + o(K\u22122). (11)\nProof. See Appendix A, page 13.\nBoth the bias B[L\u0302K ] and the variance V[L\u0302K ] vanish for K \u2192\u221e at a rate of O(1/K) with similar coefficients. This leads to the following result which was already proven in Burda et al. (2015).\nCorollary 2 (Consistency of L\u0302K ). For K \u2192\u221e the estimator L\u0302K is consistent, that is, for all > 0\nlim K\u2192\u221e\nP (|L\u0302K \u2212 log p(x)| \u2265 ) = 0. (12)\nProof. See Appendix A, page 13.\nHow good are the asymptotic results? This is hard to say in general because it depends on the particular distribution P (w) of the weights. In Figure 1 we show both a simple and challenging case to demonstrate the accuracy of the asymptotics.\nThe above results are reassuring evidence for the IWAE method, however, they cannot be directly applied in practice because we do not know the moments \u00b5i. One approach is to estimate the moments from data, and this is in fact what the delta method variational inference (DVI) method does, Teh et al. (2007), (see Appendix B, page 14); however, estimating moments accurately is difficult. We avoid the difficulty of estimating moments by use of the jackknife, a classic debiasing method. We now review this method. 4 A BRIEF REVIEW OF THE JACKKNIFE\nWe now provide a brief review of the jackknife and generalized jackknife methodology. Our presentation deviates from standard textbook introductions, Miller (1974), in that we also review higher-order variants.\nThe jackknife methodology is a classic resampling technique originating with Quenouille (1949; 1956) in the 1950s. It is a generally applicable technique for estimating the bias B[T\u0302 ] = E[T\u0302 ]\u2212 T and the variance V[T\u0302 ] of an estimator T\u0302 . Our focus is on estimating and correcting for bias.\nThe basic intuition is as follows: in many cases it is possible to write the expectation of a consistent estimator T\u0302n evaluated on n samples as an asymptotic expansion in the sample size n, that is, for large n\u2192\u221e we have\nE[T\u0302n] = T + a1 n + a2 n2 + . . . . (13)\nIn particular, this is possible in case the estimator is consistent and a smooth function of linear statistics. If an expansion (13) is possible, then we can take a linear combination of two estimators T\u0302n and T\u0302n\u22121 to cancel the first order term,\nE[nT\u0302n \u2212 (n\u2212 1)T\u0302n\u22121] = n ( T +\na1 n + a2 n2\n) \u2212 (n\u2212 1) ( T +\na1 n\u2212 1 + a2 (n\u2212 1)2\n) +O(n\u22122)\n= T + a2 n \u2212 a2 n\u2212 1 +O(n\u22122) (14) = T \u2212 a2 n(n\u2212 1) +O(n\u22122) (15)\n= T +O(n\u22122). (16)\nTherefore, the jackknife bias-corrected estimator T\u0302J := nT\u0302n \u2212 (n\u2212 1)T\u0302n\u22121 achieves a reduced bias of O(n\u22122). For T\u0302n\u22121 any estimator which preserves the expectation (13) can be used. In practice\nwe use the original sample of size n to create n subsets of size n\u2212 1 by removing each individual sample once. Then, the empirical average of n estimates T\u0302 \\in\u22121, i = 1, . . . , n is used in place of T\u0302n\u22121. In Sharot (1976) this construction was proved optimal in terms of maximally reducing the variance of T\u0302J for any given sample size n.\nIn principle, the above bias reduction (16) can be repeated to further reduce the bias to O(n\u22123) and beyond. The possibility of this was already hinted at in Quenouille (1956) by means of an example.2 A fully general and satisfactory solution to higher-order bias removal was only achieved by the generalized jackknife of Schucany et al. (1971), considering estimators T\u0302G of order m, each having the form,\nT\u0302 (m) G = m\u2211 j=0 c(n,m, j) T\u0302n\u2212j . (17)\nThe form of the coefficients c(n,m, j) in (17) are defined by the ratio of determinants of certain Vandermonde matrices, see Schucany et al. (1971). In a little known result, an analytic solution for c(n,m, j) is given by Sharot (1976). We call this form the Sharot coefficients, (Sharot, 1976, Equation (2.5) with r = 1), defined for m < n and 0 \u2264 j \u2264 m,\nc(n,m, j) = (\u22121)j (n\u2212 j) m\n(m\u2212 j)! j! . (18)\nThe generalized jackknife estimator T\u0302 (m)G achieves a bias of order O(m \u2212(j+1)), see Schucany et al. (1971). For example, the classic jackknife is recovered because c(n, 1, 0) = n and c(n, 1, 1) = \u2212(n\u2212 1). As an example of the second-order generalized jackknife we have\nc(n, 2, 0) = n2 2 , c(n, 2, 1) = \u2212(n\u2212 1)2, c(n, 2, 2) = (n\u2212 2) 2 2 . (19)\nThe variance of generalized jackknife estimators is more difficult to characterize and may in general decrease or increase compared to T\u0302n. Typically we have V[T\u0302 (m+1)G ] > V[T\u0302 (m) G ] with asymptotic rates being the same.\nThe generalized jackknife is not the only method for debiasing estimators systematically. One classic method is the delta method for bias correction Small (2010). Two general methods for debiasing are the iterated bootstrap for bias correction (Hall, 2016, page 29) and the debiasing lemma McLeish (2010); Strathmann et al. (2015); Rhee & Glynn (2015). Remarkably, the debiasing lemma exactly debiases a large class of estimators.\nThe delta method bias correction has been applied to variational inference by Teh et al. (2007); we provide novel theoretical results for the method in Appendix B, page 14. 5 JACKKNIFE VARIATIONAL INFERENCE (JVI)\nWe now propose to apply the generalized jackknife for bias correction to variational inference by debiasing the IWAE estimator. The resulting estimator of the log-marginal likelihood will have significantly reduced bias, however, in contrast to the ELBO and IWAE, it is no longer a lower bound on the true log-marginal likelihood. Moreover, it can have increased variance compared to both IWAE and ELBO estimators. We will empirically demonstrate that the variance is comparable to the IWAE estimate and that the bias reduction is very effective in improving our estimates.\nDefinition 1 (Jackknife Variational Inference (JVI)). Let K \u2265 1 and m < K. The jackknife variational inference estimator of the evidence of order m with K samples is\nL\u0302J,mK := m\u2211 j=0 c(K,m, j) L\u0304K\u2212j , (20)\nwhere L\u0304K\u2212j is the empirical average of one or more IWAE estimates obtained from a subsample of size K \u2212 j, and c(K,m, j) are the Sharot coefficients defined in (18). In this paper we use all\n2Which was subtly wrong and did not reduce the bias to O(n\u22122) as claimed, see Schucany et al. (1971).\npossible ( K K\u2212j ) subsets, that is,\nL\u0304K\u2212j := 1( K K\u2212j ) ( KK\u2212j)\u2211 i=1 L\u0302K\u2212j(Z(K\u2212j)i ), (21)\nwhere Z(K\u2212j)i is the i\u2019th subset of size K \u2212 j among all ( K K\u2212j ) subsets from the original samples Z = (z1, z2, . . . , zK). We further define LJ,mK = EZ [L\u0302 J,m K ].\nFrom the above definition we can see that JVI strictly generalizes the IWAE bound and therefore also includes the standard ELBO objective: we have the IWAE case for L\u0302J,0K = L\u0302K , and the ELBO case for L\u0302J,01 = L\u0302E .\n5.1 ANALYSIS OF L\u0302J,mK\nThe proposed family of JVI estimators has less bias than the IWAE estimator. The following result is a consequence of the existing theory on the generalized jackknife bias correction.\nProposition 3 (Bias of L\u0302J,mK ). For any K \u2265 1 and m < K we have that the bias of the JVI estimate satisfies\nB[L\u0302J,mK ] = E[L\u0302 J,m K \u2212 log p(x)] = L J,m K \u2212 log p(x) = O(K \u2212(m+1)). (22)\nProof. The JVI estimator L\u0302J,mK is the application of the higher-order jackknife to the IWAE estimator which has an asymptotic expansion of the bias (10) in terms of orders of 1/K. The stated result is then a special case of (Schucany et al., 1971, Theorem 4.2).\nWe show an illustration of higher-order bias removal in Appendix C, page 15. It is more difficult to characterize the variance of L\u0302J,mK . Empirically we observe that V[L\u0302 J,m K ] < V[L\u0302 J,m\u2032 K ] form < m \u2032, but we have been unable to derive a formal result to this end. Note that the variance is over the sampling distribution of q(z|x), so we can always reduce the variance by averaging multiple estimates L\u0302J,mK , whereas we cannot reduce bias this way. Therefore, reducing bias while increasing variance is a sensible tradeoff in our application.\n5.2 EFFICIENT COMPUTATION OF L\u0302J,mK\nWe now discuss how to efficiently compute (20). For typical applications, for example in variational autoencoders, we will use small values of K, say K < 100. However, even with K = 50 and m = 2 there are already 1276 IWAE estimates to compute in (20\u201321). Therefore efficient computation is important to consider. One property that helps us is that all these IWAE estimates are related because they are based on subsets of the same weights. The other property that is helpful is that computation of the K weights is typically orders of magnitude more expensive than elementary summation operations required for computation of (21).\nWe now give a general algorithm for computing the JVI estimator L\u0302J,mK , then give details for efficient implementation on modern GPUs and state complexity results.\nAlgorithm 1 computes log-weights and implements equations (20\u201321) in a numerically robust manner.3\nProposition 4 (Complexity of Algorithm 1). Given K \u2265 1 and m \u2264 K/2 the complexity of Algorithm 1 is\nO ( Kem ( K\nm\n)m) . (23)\nProof. See Appendix C, page 15.\n3As usual, the log-sum-exp operation needs to be numerically robustly implemented. 101 102 103\nThe above algorithm is suitable for CPU implementation; to utilize modern GPU hardware efficiently we can instead represent the second part of the algorithm using matrix operations. We provide further details in Appendix C, page 16. Figure 2 demonstrates experimental runtime evaluation on the MNIST test set for different JVI estimators. We show all JVI estimators with less than 5,000 total summation terms. The result demonstrates that runtime is largely independent of the order of the JVI correction and only depends linearly on K. 5.3 VARIATIONS OF THE JVI ESTIMATOR\nVariations of the JVI estimator with improved runtime exist. Such reduction in runtime are possible if we consider evaluating only a fraction of all possible subsets in (21). When tractable, our choice of evaluating all subsets is generally preferable in terms of variance of the resulting estimator. However, to show that we can even reduce bias to order O(K\u2212K) at cost O(K) we consider the estimator\nL\u0302XK := K\u22121\u2211 j=0 c(K,K \u2212 1, j) L\u0302K\u2212j(Z1:(K\u2212j)) (24)\n= c(K,K \u2212 1,K \u2212 1) log(exp(vK)) (25) + c(K,K \u2212 1,K \u2212 2) log ( 1\n2 (exp(vK\u22121) + exp(vK))\n) (26)\n+ \u00b7 \u00b7 \u00b7+ c(K,K \u2212 1, 0) log\n( 1\nK K\u2211 i=1 exp(vi)\n) . (27)\nThe sum (25\u201327) can be computed in time O(K) by keeping a running partial sum \u2211k i=1 exp(vi) for k \u2264 K and by incrementally updating this sum4, meaning that (24) can be computed in O(K) 4To do this in a numerically stable manner, we need to use streaming log-sumexp computations, see for example http://www.nowozin.net/sebastian/blog/ streaming-log-sum-exp-computation.html\noverall. As a generalized jackknife estimate L\u0302XK has bias O(K\u2212K). We do not recommend its use in practice because its variance is large, however, developing estimators between the two extremes of taking one set and taking all sets of subsets of a certain size seems a good way to achieve high-order bias reduction while controlling variance. 6 EXPERIMENTS\nWe now empirically validate our key claims regarding the JVI method: 1. JVI produces better estimates of the marginal likelihood by reducing bias, even for small K; and 2. Higher-order bias reduction is more effective than lower-order bias reduction;\nTo this end we will use variational autoencoders trained on MNIST. Our setup is purposely identical to the setup of Tomczak & Welling (2016), where we use the dynamically binarized MNIST data set of Salakhutdinov & Murray (2008). Our numbers are therefore directly comparable to the numbers reported in the above works. Our implementation is available at https://github. com/Microsoft/jackknife-variational-inference.\nWe first evaluate the accuracy of evidence estimates given a fixed model. This setting is useful for assessing model performance and for model comparison. 6.1 JVI AS EVALUATION METHOD\nWe train a regular VAE on the dynamically binarized MNIST dataset using either the ELBO, IWAE, or JVI-1 objective functions. We use the same two-layer neural network architecture with 300 hidden units per layer as in (Tomczak & Welling, 2016). We train on the first 50,000 training images, using 10,000 images for validation. We train with SGD for 5,000 epochs and take as the final model the model with the maximum validation objective, evaluated after every training epoch. Hyperparameters are the batch size in {1024, 4096} and the SGD step size in {0.1, 0.05, 0.01, 0.005, 0.001}. The final model achieving the best validation score is evaluated once on the MNIST test set. All our models are implemented using Chainer (Tokui et al., 2015) and run on a NVidia Titan X.\nFor three separate models, trained using the ordinary ELBO, IWAE, and JVI-1 objectives, we then estimate the marginal log-likelihood (evidence) on the MNIST test set. For evaluation we use JVI estimators up to order five in order to demonstrate higher-order bias reduction. Among all possible JVI estimators up to order five we evaluate only those JVI estimators whose total sum of IWAE estimates has less than 5,000 terms. For example, we do not evaluate L\u0302J,332 because it contains( 32 0 ) + ( 32 1 ) + ( 32 2 ) + ( 32 3 ) = 5489 terms.5\nFigure 3 shows the evidence estimates for three models. We make the following observations, applying to all plots: 1. Noting the logarithmic x-axis we can see that higher-order JVI estimates are more than one order of magnitude more accurate than IWAE estimates. 2. The quality of the evidence estimates empirically improves monotonically with the order of the JVI estimator; 3. In absolute terms the improvements in evidence estimates is larges for small values of K, which is what is typically used in practice; 4. The higher-order JVI estimators remove low-order bias but significant higher-order bias remains even for K = 64, showing that on real VAE log-weights the contribution of higher-order bias to the evidence error is large; 5. The standard error of each test set marginal likelihood (shown as error bars, best visible in a zoomed version of the plot) is comparable across all JVI estimates; this empirically shows that higher-order bias reduction does not lead to high variance. 6.2 JVI AS A TRAINING OBJECTIVE\nWe now report preliminary results on learning models using the JVI objectives. The setting is the same as in Section 6.1 and we report the average performance of five independent runs.\nTable 1 reports the results. We make the following observations: 1. When training on the IWAE and JVI-1 objectives, the respective score by the ELBO objective is impoverished and this effect makes\n5We do this because we discovered numerical issues for large sums of varying size and found all summations of less than a few thousand terms not to have this problem but we are looking into a way to compute more summation terms in a fast and robust manner.\nsense in light of the work of Cremer et al. (2017). Interestingly the effect is stronger for JVI-1. 2. The model trained using the JVI-1 objective falls slightly behind the IWAE model, which is surprising because the evidence is clearly better approximated as demonstrated in Section 6.1. We are not sure what causes this issue, but have two hypotheses: First, in line with recent findings (Rainforth et al., 2017b) a tighter log-evidence approximation could lead to poor encoder models. In such case it is worth exploring two separate learning objectives for the encoder and decoder; for example, using an ELBO for training the encoder, and an IWAE or JVI-1 objective for training the decoder. Second, because JVI estimators are no longer bounds it could be the case that during optimization of the learning objective a decoder is systematically learned in order to amplify positive bias in the log-evidence. 8 CONCLUSION\nIn summary we proposed to leverage classic higher-order bias removal schemes for evidence estimation. Our approach is simple to implement, computationally efficient, and clearly improves over existing evidence approximations based on variational inference. More generally our jackknife variational inference debiasing formula can also be used to debias log-evidence estimates coming from annealed importance sampling.\nHowever, one surprising finding from our work is that using our debiased estimates for training VAE models did not improve over the IWAE training objective and this is surprising because apriori a better evidence estimate should allow for improved model learning.\nOne possible extension to our work is to study the use of other resampling methods for bias reduction; promising candidates are the iterated bootstrap, the Bayesian bootstrap, and the debiasing lemma. These methods could offer further improvements on bias reduction or reduced variance, however, the key challenge is to overcome computational requirements of these methods or, alternatively, to derive key quantities analytically.6 Application of the debiasing lemma in particular requires the careful construction of a truncation distribution and often produces estimators of high variance.\nWhile variance reduction plays a key role in certain areas of machine learning, our hope is that our work shows that bias reduction techniques are also widely applicable. APPENDIX A: PROOFS FOR THE IWAE ANALYSIS\n EXPECTATION AND BIAS OF L\u0302K\nProof. (Of Proposition 1, page 2) To show (8) we apply the delta method for moments (Small, 2010, Section 4.3). First, we define the random variable YK := 1K \u2211K i=1 wi corresponding to the sample mean of w1, . . . , wK . Because of linearity of expectation we have E[YK ] = E[w]. We expand the logarithm function log YK = log(E[w] + (YK \u2212 E[w])) around E[w] to obtain\nlog(E[w] + (YK \u2212 E[w])) = logE[w]\u2212 \u221e\u2211 j=1 (\u22121)j j E[w]j (YK \u2212 E[w])j . (28)\nNote that only YK is random in (28), all other quantities are constant. Therefore, by taking the expectation on the left and right side of (28) we obtain\nE[log YK ] = logE[w]\u2212 \u221e\u2211 j=1 (\u22121)j j E[w]j E[(YK \u2212 E[w])j ]. (29)\nThe right hand side of (29) is expressed in terms of the central moments for i \u2265 2, \u03b3i := E[(YK \u2212 E[YK ])i] of YK , whereas we are interested in an expression using the central moments i \u2265 2, \u00b5i := E[(w \u2212 E[w])i] of P . With \u03b3 = \u00b5 = E[w] we denote the shared first non-central moment. Because YK is a sample mean we can use existing results that relate \u03b3i to \u00b5i. In particular (Angelova, 2012, Theorem 1) gives the relations\n\u03b3 = \u00b5 (30)\n\u03b32 = \u00b52 K\n(31)\n\u03b33 = \u00b53 K2\n(32)\n\u03b34 = 3\nK2 \u00b522 +\n1 K3 ( \u00b54 \u2212 3\u00b522 ) (33)\n\u03b35 = 10\nK3 \u00b53\u00b52 +\n1\nK4 (\u00b55 \u2212 10\u00b53\u00b52) . (34)\nExpanding (29) to order five and using the relations (30) to (34) gives\nE[log YK ] = logE[w]\u2212 1 2\u00b52 \u00b52 K + 1 3\u00b53 \u00b53 K2 \u2212 1 4\u00b54\n( 3\nK2 \u00b522 +\n1 K3 ( \u00b54 \u2212 3\u00b522 )) + 1\n5\u00b55\n( 10\nK3 \u00b53\u00b52 +\n1\nK4 (\u00b55 \u2212 10\u00b53\u00b52)\n) + o(K\u22123). (35)\nRegrouping the terms by order of K produces the result (8). VARIANCE OF L\u0302K\nProof. (Of Proposition 2, page 3) We use the definition of the variance and the series expansion of the logarithm function, obtaining\nV[log YK ] = E[(log YK \u2212 E[log YK ])2] (36)\n= E (log\u00b5\u2212 \u221e\u2211 i=1 (\u22121)i i\u00b5i (YK \u2212 \u00b5)i \u2212 log\u00b5+ \u221e\u2211 i=1 (\u22121)i i\u00b5i E[(YK \u2212 \u00b5)i] )2 = E\n( \u221e\u2211 i=1 (\u22121)i i\u00b5i ( E[(YK \u2212 \u00b5)i]\u2212 (YK \u2212 \u00b5)i ))2 . (37) By expanding (37) to third order and expanding all products we obtain a moment expansion of YK as follows.\nV[log YK ] \u2248 \u03b32 \u00b52 \u2212 1 \u00b53 (\u03b33 \u2212 \u03b31\u03b32) + 2 3\u00b54 (\u03b34 \u2212 \u03b31\u03b33) + 1 4\u00b54 (\u03b34 \u2212 \u03b322) (38)\n\u2212 1 3\u00b55 (\u03b35 \u2212 \u03b32\u03b33) + 1 9\u00b56 (\u03b36 \u2212 \u03b323). (39)\nBy substituting the sample moments \u03b3i of YK with the central moments \u00b5i of the original distribution P and simplifying we obtain\nV[log YK ] = 1\nK \u00b52 \u00b52 \u2212 1 K2 ( \u00b53 \u00b53 \u2212 5\u00b5 2 2 2\u00b54 ) + o(K\u22122). (40) CONSISTENCY OF L\u0302K\nProof. We have\nP (|L\u0302K \u2212 log p(x)| \u2265 ) = P (|L\u0302K \u2212 E[L\u0302K ] + E[L\u0302K ]\u2212 log p(x)| \u2265 ) (41) \u2264 P (|L\u0302K \u2212 E[L\u0302K ]|+ |E[L\u0302K ]\u2212 log p(x)| \u2265 ) (42)\n\u2264 P (|L\u0302K \u2212 E[L\u0302K ]| \u2265 2 ) + P (|E[L\u0302K ]\u2212 log p(x)| \u2265 2 ). (43)\nThe second term in (43) does not involve a random variable therefore is either zero or one. For large enough K it will always be zero due to (10).\nFor the first term in (43) we apply Chebyshev\u2019s inequality. We set \u03c4 = 2 \u221a V[L\u0302K ] and have\nP (|L\u0302K \u2212 E[L\u0302K ]| \u2265 2 ) = P (|L\u0302K \u2212 E[L\u0302K ]| \u2265 \u03c4\n\u221a V[L\u0302K ]) (44)\n\u2264 1 \u03c42\n(45)\n= 4 V[L\u0302K ] 2\n(46)\n= O(1/K). (47)\nThus, for K \u2192\u221e and any > 0 we have that (43) has a limit of zero. This establishes convergence in probability and hence consistency. APPENDIX B: ANALYSIS OF DELTA-METHOD VARIATIONAL INFERENCE (DVI)\nDefinition 2 (Delta method Variational Inference (DVI), (Teh et al., 2007)).\nLDK := Ez1,...,zK\u223cq\u03c9(z|x)\n[ log 1\nK K\u2211 i=1 wi + w\u03022 2Kw\u0302\n] , (48)\nwhere\nwi = p(x|zi) p(zi) q\u03c9(zi|w) , i = 1, . . . ,K, (49)\nw\u03022 := 1\nK \u2212 1 K\u2211 i=1 (wi \u2212 w\u0302)2, (50)\nw\u0302 := 1\nK K\u2211 i=1 wi, (51)\nso that w\u03022 corresponds to the sample variance and w\u0302 corresponds to the sample mean.\nThe practical Monte Carlo estimator of (48) is defined as follows. zi \u223c q\u03c9(z|x), i = 1, . . . ,K, (52)\nL\u0302DK := log 1\nK K\u2211 i=1 wi + w\u03022 2Kw\u0302 . (53) ANALYSIS OF DELTA METHOD VARIATIONAL INFERENCE\nProposition 5 (Bias of L\u0302DK). We evaluate the bias of L\u0302DK in (53) as follows.\nB[L\u0302DK ] = \u2212 1\nK2 ( \u00b53 \u00b53 \u2212 3\u00b5 2 2 2\u00b54 ) + o(K\u22122). (54)\nProof. Consider the function f(x, y) = xy2 and its second order Taylor expansion around (x, y) = (\u00b52, \u00b5),\nf(\u00b52 + (\u00b5\u03022 \u2212 \u00b52), \u00b5+ (\u00b5\u0302\u2212 \u00b5)) \u2248 \u00b52 \u00b52 + 1 \u00b52 (\u00b5\u03022 \u2212 \u00b52)\u2212 2\u00b52 \u00b53 (\u00b5\u0302\u2212 \u00b5) (55)\n\u2212 2 \u00b53 (\u00b5\u03022 \u2212 \u00b52)(\u00b5\u0302\u2212 \u00b5) + 6\u00b52 2\u00b54 (\u00b5\u0302\u2212 \u00b5)2. (56)\nTaking expectations on both sides cancels all linear terms and yields E [ \u00b5\u03022 \u00b5\u03022 ] \u2248 \u00b52 \u00b52 \u2212 2 \u00b53 E[(\u00b5\u03022 \u2212 \u00b52)(\u00b5\u0302\u2212 \u00b5)] + 3\u00b52 \u00b54 E[(\u00b5\u0302\u2212 \u00b5)2]. (57)\nBy classic results we have that the expected variance of the sample mean around the true mean is related to the variance by E[(\u00b5\u0302\u2212\u00b5)2] = \u00b52/K. Furthermore, Zhang (2007) showed a beautiful result about the covariance of sample mean and sample variance for arbitrary random variables, namely that\nCov[\u00b5\u03022, \u00b5\u0302] = E[(\u00b5\u03022 \u2212 \u00b52)(\u00b5\u0302\u2212 \u00b5)] = \u00b53/K. (58) Using both results in (57) produces\nE [ \u00b5\u03022 \u00b5\u03022 ] = \u00b52 \u00b52 \u2212 1 K ( 2\u00b53 \u00b53 \u2212 3\u00b5 2 2 \u00b54 ) + o(K\u22121). (59)\nWe can now decompose the expectation of L\u0302DK as follows.\nE[L\u0302DK ] = E\n[ log 1\nK K\u2211 i=1 wi\n] + E [ \u00b5\u03022\n2K\u00b5\u03022\n] (60)\n= logE[w]\u2212 \u00b52 2K\u00b52 + 1 2K ( \u00b52 \u00b52 \u2212 1 K ( 2\u00b53 \u00b53 \u2212 3\u00b5 2 2 \u00b54 ) + o(K\u22121) ) (61)\n= logE[w]\u2212 1 K2 ( \u00b53 \u00b53 \u2212 3\u00b5 2 2 2\u00b54 ) + o(K\u22122). (62)\nNotably, in (62) the 1/K term is cancelled exactly by the delta method correction, even though we used an empirical ratio estimator \u00b5\u03022/\u00b5\u03022. Subtracting the true mean log p(x) = logE[w] from (62) yields the bias (54) and completes the proof. EXPERIMENTAL COMPARISON OF DVI AND JVI\nWe perform the experiment shown in Figure 1 including the DVI estimator. The result is shown in Figure 4 and confirms that DVI reduces bias but that for the challenging case JVI is superior in terms of bias reduction. APPENDIX C: MORE JVI DETAILS\n COMPLEXITY PROOF\nProof. The first for loop of the algorithm has complexity O(K). The second part of the algorithm considers all subsets of size K, K \u2212 1, . . . , K \u2212m. In total these are S(K,m) = \u2211m j=0 ( K K\u2212j ) =\u2211m\nj=0 ( K j ) sets. Justin Melvin derived a bound on this partial binomial sum7, as\nS(K,m) \u2264 em ( K\nm\n)m . (63)\nFor each of the S(K,m) sets we have to perform at most K operations to compute the log-sum-exp operation, which yields the stated complexity bound. HIGHER-ORDER BIAS REMOVAL DEMONSTRATION\nWe illustrate the behaviour of the higher-order JVI estimators on the same P = Gamma(0.1, 1) example we used previously. Figure 5 demonstrates the increasing order of bias removal, O(K\u2212(m+1)) for the L\u0302J,mK estimators.\n7See https://mathoverflow.net/questions/17202/sum-of-the-first-k-binomial-coefficients-for-fixed-n.\nGPU IMPLEMENTATION OF JVI\nTo this end let K \u2265 1 and m < K be fixed and assume the log-weights vi are concatenated in one column vector of K elements. We then construct a matrix B of size (|S|,K), where S is the set of all subsets that will be considered,\nS = m\u22c3 j=0 EnumerateSubsets({1, . . . ,K},K \u2212 j). (64)\nThere are |S| rows in B and each row in B corresponds to a subset S \u2208 S of samples so that we can use S to index the rows in B. We set\nBS,i = 1\n|S| Ii\u2208S , (65)\nwhere Ipred is one if the predicate is true and zero otherwise. We furthermore construct a vector A with |S| elements. We set\nAS = c(K,m,K \u2212 |S|)/ ( K\nK \u2212 |S|\n) = (\u22121)K\u2212|S| |S|! |S| m\nK! (m\u2212K + |S|)! . (66)\nUsing these definitions we can express the estimator as A> log(B exp(v)), with the log and exp operations being elementwise. However, this is not numerically robust. Instead we can compute the estimator in the log domain as logsumexp2(IS\u00d71v> + logB)A, where logsumexp2 denotes a log-sum-exp operation along the second axis. This can be easily implemented in modern neural network frameworks and we plan to make our implementation available.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Sampling from a log-concave distribution with compact support with proximal langevin monte carlo , Authors: Nicolas Brosse Alain Durmus... , 2017\n2.  Title: Finite-time analysis of projected langevin monte carlo , Authors: Sebastien Bubeck Ronen Eldan... , 2015\n3.  Title: Kernel exponential family estimation via doubly dual embedding , Authors: Bo Dai Hanjun Dai... , 2018\n4.  Title: Optimal rates for zero-order convex optimization: The power of two function evaluations , Authors: John C Duchi Michael I Jordan... , 2015\n5.  Title: Mirrored langevin dynamics , Authors: Ya-Ping Hsieh Ali Kavis... , 2018\n6.  Title: Estimation of non-normalized statistical models by score matching , Authors: Aapo Hyv\u00e4rinen , 2005\n7.  Title: Non-asymptotic results for langevin monte carlo: Coordinate-wise and black-box sampling , Authors: Lingqing Shen Krishnakumar Balasubramanian... , 2019\n8.  Title: Under review as a conference paper at ICLR , Authors: Qiang Wu Justin Guinney... , 2020\n9.  Title: The proof here is similar to the proof of Lemma 8 in (Bubeck et al., 2015). By the data processing inequality we have: KL(\u03bcT , Authors:  , 2015\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["We consider the problem of generating configurations that satisfy physical constraints for optimal material nano-pattern design, where multiple (and often conflicting) properties need to be simultaneously satisfied. Consider, for example, the trade-off between thermal resistance, electrical conductivity, and mechanical stability needed to design a nano-porous template with optimal thermoelectric efficiency. To that end, we leverage the posterior regularization framework and show that this constraint satisfaction problem can be formulated as sampling from a Gibbs distribution. The main challenges come from the black-box nature of those physical constraints, since they are obtained via solving highly non-linear PDEs. To overcome those difficulties, we introduce Surrogate-based Constrained Langevin dynamics for black-box sampling. We explore two surrogate approaches. The first approach exploits zero-order approximation of gradients in the Langevin Sampling and we refer to it as Zero-Order Langevin. In practice, this approach can be prohibitive since we still need to often query the expensive PDE solvers. The second approach approximates the gradients in the Langevin dynamics with deep neural networks, allowing us an efficient sampling strategy using the surrogate model. We prove the convergence of those two approaches when the target distribution is log-concave and smooth. We show the effectiveness of both approaches in designing optimal nano-porous material configurations, where the goal is to produce nano-pattern templates with low thermal conductivity and reasonable mechanical stability.", "Zero-Order Methods. Zero-order optimization with Gaussian smoothing was studied in Nesterov & Spokoiny (2017) and Duchi et al. (2015) in the convex setting. Non-convex zero order optimization was also addressed in Ghadimi & Lan (2013). The closest to our work is the zero-order Langevin Shen et al. (2019) introduced recently for black-box sampling from log concave density. The main difference in our setting, is that the density has a compact support and hence the need to appeal to projected LMC (Bubeck et al., 2015) and Proximal LMC (Brosse et al., 2017). It is worth nothing that Hsieh et al. (2018) introduced recently mirror Langevin sampling that can also be leveraged in our framework. Gradients and Score functions Estimators. We used the approach of gradient distillation (Srinivas & Fleuret, 2018) and learning gradients of (Wu et al., 2010), since they are convenient for training on different constraints and they come with theoretical guarantees. However, other approaches can be also leveraged such as the score matching approach for learning the gradient of the log likelihood (Hyv\u00e4rinen, 2005) and other variants appealing to dual embeddings (Dai et al., 2018). Estimating gradients can be also performed using Stein\u2019s method as in (Li & Turner, 2017), or via maintaining a surrogate of the gradient as in Stein descent without gradient (Han & Liu, 2018). Optimization approaches. Due to space limitation, we restrict the discussion to the optimization methods that are most commonly and recently used for optimal material (or molecule) design.", "The paper considers the problem of sampling points from a constrained set in R^d where the constraints can only be accessed in a zero order fashion. They consider the specific situation where the constraints are a solution of a complicated PDE solver and hence the derivatives or specific functional forms of the constraint cannot be obtained. They repose the problem as sampling from a Gibbs distribution whose potential contains constraints as penalties in a Lagrangian fashion. They now wish to sample from the Gibbs distribution using Langevin diffusion.  The Langevin process requires a derivative of the gradient. The setting does not allow for that and therefore the authors propose two approaches - \n\n1. Constructing the gradient from zeroth order entries of a gaussian smoothed potential (much like works of Nesterov et al on zero order optimization). \n2. Using a parameteric function class (like an RKHS or a neural network) to learn a function which well approximates the gradient of the constraints as well given zeroth order constraint evaluations. \n\n\nThe theoretical analysis seems quite rigorous as it begins by first providing a basic guarantee for constrained langevin sampling when gradients are computed with error. The non error gradient part of this analysis has been established before and the authors mention the references appropriately. I have a couple of questions regarding the precise statements of the theorem that i will ask towards the end of the review. Overall its hard to comment on the tightness of the analysis as the non-error versions are also unclear of the tightness of the bounds. Nevertheless the bounds achieved do not look much worse than the non-error counterparts and are easy to implement. The rest of the bounds focus on achieving low error in approximation of the gradients in various settings. Overall the theory in this part seems very loose in terms of bounds as exponential factors in dimensions start to appear and in that regard seems quite preliminary but its hard to comment on whether its natural or can be improved. \n\nSome questions I have regarding the theorem statements \n\nRegarding theorem 1 SPLMC convergence  (and corollaries of the theorem) -  I find it surprising that there is no lower bound assumption on eta in terms of K - only an upper bound. This seems wrong particularly as the theorem as stated then allows eta to be set extremely small while K is finite, in which case there should be no convergence theorems at all. The condition on eta should be a theta(f (K)) for some f type of condition like in the second part of the theorem. I would suggest the authors to look into the theorem - or claify why this is the case. \n\nI am confused by the presentation of the Shi et al results as there is no penalty appearing for the approximation error due to an RKHS, only a finite sample penalty. Does the result assume that phi belongs to the function class of the RKHS in question? Probably yes and in that case that should be specified. \n\n6: Weak Accept"], "input": "1 INTRODUCTION\nIn many real-world design problems, the optimal design needs to simultaneously satisfy multiple constraints, which can be expensive to estimate. For example, in computational material design, the goal is to come up with material configurations, or samples, satisfying a list of physical constraints that are given by black-box numerical Partial Differential Equations (PDE) solvers. Such solvers (for example, the Boltzmann Transport Equation solver) are often complex, expensive to evaluate, and offer no access to their inner variables or their gradients.\nWe pose this design-under-constraints problem as sampling from a Gibbs distribution defined on some compact support. The problem of sampling from a distribution with unknown likelihood that can only be point-wise evaluated is called black-box sampling (Chen & Schmeiser, 1998; Neal, 2003). We show in this paper that constrained black-box sampling can be cast as a constrained Langevin dynamics with gradient-free methods. Zero-order optimization via Gaussian smoothing was introduced in Nesterov & Spokoiny (2017) and extended to black-box sampling with Langevin dynamics in Shen et al. (2019). We extend this approach to the constrained setting from a black-box density with compact support.\nHowever, one shortcoming of this approach is that it is computationally very expensive since it requires repeatedly querying PDE solvers in order to get an estimate of the gradient. To alleviate computational issues, we propose Surrogate Model Based Langevin dynamics, that consists of two steps: (i) Learning (using training data) an approximation of the gradient of the potential of the Gibbs distribution. We show that learning the gradient, rather than the potential itself, is important for the\nmixing of the Langevin dynamics towards the target Gibbs distribution. We devise several objective functions, as well as deep neural-network architectures for parameterizing the approximating function class, for learning the gradient of the potential function. (ii) We then use the surrogate gradient model in the constrained Langevin dynamics in lieu of the black-box potential. Using the surrogate enables more efficient sampling, since it avoids querying the expensive PDE solvers, and obtaining gradients is as efficient as evaluating the functions themselves using automatic differentiation frameworks such as PyTorch or TensorFlow.\nTo summarize, our main contributions are as follows:\n1. We cast the problem of generating samples under constraints in the black-box setting as sampling from a Gibbs distribution.\n2. We introduce Constrained Zero-Order Langevin Monte Carlo, using projection or proximal methods, and provide the proof of its convergence to the target Gibbs distribution.\n3. We introduce Surrogate Model Based Projected Langevin Monte Carlo via learning the gradient of the potential of the Gibbs distribution using deep neural networks or reproducing kernel spaces, and prove its convergence to the target distribution when used in conjunction with projection or proximal based methods. We shed the light on the importance of the approximation of the gradient of the potential, and we show how to achieve this using Hermite and Taylor learning.\n4. We showcase the usability and effectiveness of the proposed methods for the design of nanoporous configurations with improved thermoelectric efficiency. The design consists of finding new configurations with optimized pore locations, such that the resulting configurations have favorable thermal conductivity (i.e., minimal \u03ba) and desired mechanical stability (von Mises Stress \u03c3 \u2264 \u03c4 , where \u03c4 is some preset threshold). 1 2 FROM CONSTRAINTS SATISFACTION TO SAMPLING FROM A GIBBS DISTRIBUTION: POSTERIOR REGULARIZATION\nIn black-box optimization problems (such as the material design under consideration), the goal is to find a posterior distribution q of samples satisfying a list of equality and inequality constraints: \u03c8j(x) = yk, j = 1 . . . Ce, and \u03c6k(x) \u2264 bk, k = 1 . . . Ci where x \u2208 \u2126 and \u2126 \u2282 Rd is a bounded domain. We assume a prior distribution p0 (whose analytical form is known). The main challenge in black-box optimization is that the functions \u03c8j and \u03c6k can be only evaluated point-wise, and neither do we have functional forms nor access to their gradients. For example, \u03c8 and \u03c6 might be obtained via aggregating some statistics on the solution of a nonlinear PDE given by a complex solver.\nTo make the problem of learning under constraints tractable, we choose Lagrangian parameters \u03bbj > 0 and obtain the following relaxed objective:\nmin q, \u222b \u2126 q(x)=1 KL(q, p0) + Ce\u2211 j=1 \u03bbjEx\u223cq(\u03c8j(x)\u2212 yk)2 + Ci\u2211 k=1 \u03bbkEx\u223cq(\u03c6k(x)\u2212 bk)+ (1)\nThe formulation in Eq. 1 is similar in spirit to the posterior regularization framework of Ganchev et al. (2010); Hu et al. (2018). However, we highlight two differences: (i) our focus is on constrained settings (where \u2126 is bounded), and (ii) we assume a black-box setting. We first obtain: Lemma 1 (Constraint Satisfaction as Sampling from a Gibbs Distribution). The solution to the distribution learning problem given in Eq. 1 is given by:\n\u03c0(x) = exp(\u2212U(x))\nZ 1x\u2208\u2126 (2)\nwhere U(x) = \u2212 log p0(x) + \u2211Ce j=1 \u03bbj(\u03c8j(x) \u2212 yk)2 + \u2211Ci k=1 \u03bbk(\u03c6k(x) \u2212 bk)+ and Z =\u222b\nx\u2208\u2126 exp (\u2212U(x)) dx.\nLemma 1 shows that the constraint satisfaction problem formulated in Eq. 1 amounts to sampling from a Gibbs distribution defined on a compact support given in Eq. 2. Sampling from a Gibbs\n1Note that both properties \u03ba and \u03c3 for a given configuration are obtained by numerically solving highly non-linear PDEs. The material configuration is defined by the pore locations, the material used, and the response of the material to heat (thermal) or stress (mechanical) flows.\ndistribution (also known as Boltzmann distribution) has a long history using Langevin dynamics. In the white-box setting when the functions defining the constraints have explicit analytical forms as well as their gradients, Langevin dynamics for Gibbs distribution sampling defined on a compact domain \u2126 and their mixing properties were actively studied in Bubeck et al. (2015); Brosse et al. (2017). In the next Section, we provide a more detailed review.\nRemark 1 (Relation to Bayesian Optimization). While in Bayesian optimization we are interested in finding a point that satisfies the constraints, in our setting we are interested in finding a distribution of candidate samples that satisfy (black-box) constraints. See (Suzuki et al., 2019) for more details.\nRemark 2. For the rest of the paper, we will assume p0 to be the uniform distribution on \u2126, which means that its gradients are zero on the support of the domain \u2126. Otherwise, if p0 is known and belongs to, for instance, an exponential family or a generative model prior (such as normalizing flows), we can sample from \u03c0 using a mixture of black-box sampling on the constraints (\u03c8j , \u03c6k) and white-box sampling on log(p0). 3 WHITE-BOX SAMPLING: CONSTRAINED LANGEVIN DYNAMICS\nWe review in this section Langevin dynamics in the unconstrained case (\u2126 = Rd) and the constrained setting (\u2126 \u2282 Rd). Below, \u2016\u00b7\u2016 denotes the Euclidean norm unless otherwise specified. We are interested in sampling from\n\u03c0(x) = 1\nZ exp(\u2212U(x))1x\u2208\u2126, (3)\nPreliminaries. We give here assumptions, definitions and few preliminary known facts that will be useful later. Those assumptions are commonly used in Langevin sampling analysis (Dalalyan, 2017; Bubeck et al., 2015; Brosse et al., 2017; Durmus et al., 2019).\n1. Assumption A: We assume \u2126 is a convex such that 0 \u2208 \u2126, \u2126 contains a Euclidean ball of radius r, and \u2126 is contained in a Euclidean ball of radius R. (For example, \u2126 might encode box constraints.) The projection onto \u2126, P\u2126(x) is defined as follows: for all x \u2208 \u2126, P\u2126(x)=arg minz\u2208\u2126 \u2016x\u2212 z\u2016\n2. Let R = supx,x\u2032\u2208\u2126 ||x\u2212 x\u2032|| <\u221e.\n2. Assumption B: We assume that U is convex, \u03b2-smooth, and with bounded gradients:\n\u2016\u2207xU(x)\u2212\u2207yU(y)\u2016 \u2264 \u03b2 \u2016x\u2212 y\u2016 , \u2200x, y \u2208 \u2126 (\u03b2-smoothness).\n\u2016\u2207U(x)\u2016 \u2264 L, \u2200x \u2208 \u2126 (Boundedness).\nThe Total Variation (TV) distance between two measures \u00b5, \u03bd is defined as follows: TV (\u00b5, \u03bd) = supA |\u00b5(A)\u2212 \u03bd(A)|. Pinsker Inequality relates KL divergence to TV: TV (\u00b5, \u03bd) \u2264 \u221a 2KL(\u00b5, \u03bd).\nUnconstrained Langevin Dynamics. In the unconstrained case, the goal is to sample from a Gibbs distribution \u03c0(x) = exp(\u2212U(x))/Z that has unbounded support. This sampling can be done via the Langevin Monte Carlo (LMC) algorithm, which is given by the following iteration:\nXk+1 = Xk \u2212 \u03b7\u2207xU(Xk) + \u221a 2\u03bb\u03b7\u03bek, k = 0 . . .K \u2212 1 (LMC), (4)\nwhere \u03bek \u223c N (0, Id), \u03b7 is the learning rate, and \u03bb > 0 is a variance term. Constrained Langevin Dynamics. In the constrained case, the goal is to sample from \u03c0(x) = exp(\u2212U(x))/Z1x\u2208\u2126,. We discuss two variants: Projected Langevin Dynamics. Similar to projected gradient descent, Bubeck et al. (2015) introduced Projected Langevin Monte Carlo (PLMC) and proved its mixing propreties towards the stationary distribution \u03c0. PLMC is given by the following iteration :\nXk+1 = P\u2126 ( Xk \u2212 \u03b7\u2207xU(Xk) + \u221a 2\u03bb\u03b7\u03bek ) , k = 0 . . .K \u2212 1 (PLMC), (5)\nIn essence, PLMC consists of a single iteration of LMC, followed by a projection on the set \u2126 using the operator P\u2126.\nProximal Langevin Dynamics. Similar to proximal methods in constrained optimization, Brosse et al. (2017) introduced Proximal LMC (ProxLMC) that uses the iteration:\nXk+1 =\n( 1\u2212 \u03b7\n\u03b3\n) Xk\u2212\u03b7\u2207xU(Xk) + \u03b7\n\u03b3 P\u2126(Xk) +\n\u221a 2\u03bb\u03b7\u03bek, k = 0 . . .K\u22121, (ProxLMC) (6)\nwhere \u03b7 is the step size and \u03b3 is a regularization parameter. In essence, ProxLMC (Brosse et al., 2017) performs an ordinary LMC on U\u03b3(x) = U(x) + i\u03b3\u2126(x), where i \u03b3 \u2126(x) is the proximal operator:\ni\u03b3\u2126(x) = infy i\u2126(x) + (2\u03b3) \u22121 \u2016x\u2212 y\u20162 = (2\u03b3)\u22121 \u2016x\u2212 P\u2126(x)\u20162 ,\nwhere i\u2126(x) = 0 for x \u2208 \u2126 and i\u2126(x) = \u221e for x /\u2208 \u2126. Therefore, the update in Eq. 6 is a regular Langevin update (as in Eq. 4) with potential gradient\u2207xU\u03b3(x) = \u2207xU(x) + \u03b3\u22121(x\u2212 P\u2126(x)). We denote by \u00b5PLMCK and \u00b5 ProxLMC K the distributions of XK obtained by iterating Eq. 5 and Eq. 6 respectively. Under Assumptions A and B, both these distributions converge to the target Gibbs distribution \u03c0 in the total variation distance. In particular, Bubeck et al. (2015) showed that for \u03b7 = \u0398\u0303(R2/K), we obtain:\nTV (\u00b5PLMCK , \u03c0) \u2264 \u03b5 for K = \u2126\u0303(\u03b5\u221212d12). (7) Likewise, Brosse et al. (2017) showed that for 0 < \u03b7 \u2264 \u03b3(1 + \u03b22\u03b32)\u22121, we obtain:\nTV (\u00b5ProxLMCK , \u03c0) \u2264 \u03b5 for K = \u2126\u0303(\u03b5\u22126d5), (8) where the notation \u03b1n = \u2126\u0303(\u03b2n) means that there exists c \u2208 R, C > 0 such that \u03b1n \u2265 C\u03b2n logc(\u03b2n). 4 CONSTRAINED LANGEVIN DYNAMICS IN THE BLACK-BOX SETTING\nWe now introduce our variants of constrained LMC for the black-box setting where explicit potential gradients are unavailable. We explore in this paper two strategies for approximating the gradient of U in the black-box setting. In the first strategy, we borrow ideas from derivative-free optimization (in particular, evolutionary search). In the second strategy we learn a surrogate deep model that approximates the gradient of the potential. Below, let G : \u2126\u2192 Rd be a vector valued function that approximates the gradient of the potential,\u2207xU . We make:\nAssumption C. The surrogate gradient G satisfies E \u2016G(Yk)\u20162 <\u221e,\u2200k. Surrogate Projected Langevin Dynamics. Given Y0, the Surrogate Projected LMC (S-PLMC) replaces the potential gradient\u2207xU in Eq. 5 with the surrogate gradient G:\nYk+1 = P\u2126 ( Yk \u2212 \u03b7G(Yk) + \u221a 2\u03bb\u03b7\u03bek ) , k = 0 . . .K \u2212 1 (S-PLMC) (9)\nSurrogate Proximal Langevin Dynamics. Similarly, the Surrogate Proximal LMC (S-ProxLMC) replaces the unknown potential gradient\u2207xU in Eq. 6 with the gradient surrogate G:\nYk+1 =\n( 1\u2212 \u03b7\n\u03b3\n) Yk \u2212 \u03b7G(Yk) + \u03b7\n\u03b3 P\u2126(Yk) +\n\u221a 2\u03bb\u03b7\u03bek, k = 0 . . .K \u2212 1 (S-ProxLMC) (10)\nWe now present our main theorems on the approximation properties of surrogate LMC (S-PLMC, and S-ProxLMC). We do so by bounding the total variation distance between the trajectories of the surrogate Langevin dynamics (S-PLMC, and S-ProxLMC) and the true LMC dynamics (PLMC and ProxLMC). Theorem 1 is an application of techniques in Stochastic Differential Equations (SDE) introduced in Dalalyan & Tsybakov (2012) and is mainly based on a variant of Grisanov\u2019s Theorem for change of measures (Lipster & Shiryaev, 2001) and Pinsker\u2019s Inequality that bounds total variation in terms of Kullback-Leibler divergence. Theorem 1 (S-PLMC and S-ProxLMC Mixing Properties). Under Assumption C, we have:\n1. S-PLMC Convergence. Let \u00b5PLMCK be the distribution of the random variable XK obtained by iterating PLMC Eq. 5, and \u00b5S-PLMCK be the distribution of the random variable YK obtained by iteration S-PLMC given in Eq. 9. We have:\nTV (\u00b5S-PLMCK , \u00b5 PLMC K ) \u2264\n\u221a \u03b7\n\u03bb ( K\u22121\u2211 k=0 E \u2016G(Yk)\u2212\u2207xU(Yk)\u20162 +K\u03b22R2 ) 1 2 . (11)\n2. S-ProxLMC Convergence. Let \u00b5ProxLMCK be the distribution of the random variable XK obtained by iterating ProxLMC Eq. 6, and \u00b5S-ProxLMCK be the distribution of the random variable YK obtained by iterating S-ProxLMC given in Eq. 10. We have:\nTV (\u00b5S-ProxLMCK , \u00b5 ProxLMC K ) \u2264\n\u221a \u03b7\n2\u03bb ( K\u22121\u2211 k=0 E \u2016G(Xk)\u2212\u2207xU(Xk)\u20162 ) 1 2 . (12)\nFrom Theorem 1, we see that it suffices to approximate the potential gradient\u2207xU(X) (and not the potential U(X)) in order to guarantee convergence of surrogate-based Langevin sampling. Using the triangle inequality, and combining Theorem 1 and bounds in Eqs 7 and 8 we obtain: Theorem 2. (Convergence of Surrogate Constrained LMC to the Gibbs distribution.) Under assumptions A,B and C we have: 1. Assume in S-PLMC that there exists \u03b4 > 0 such that E \u2016G(Yk)\u2212\u2207xU(Yk)\u20162 \u2264 \u03b4, \u2200k \u2265 0. Set\n\u03bb = 1, and \u03b7 = \u0398\u0303(min(R2/K,\u03b1/K2)) where \u03b1 = 1/(\u03b4 + \u03b22R2) . Then for K = \u2126\u0303(\u03b5\u221212d12), we have:\nTV (\u00b5S-PLMCK , \u03c0) \u2264 \u03b5.\n2. Assume in S-ProxLMC that there exists \u03b4 > 0 such that E \u2016G(Xk)\u2212\u2207xU(Xk)\u20162 \u2264 \u03b4, \u2200k \u2265 0. Set \u03bb = 1, and \u03b7 = min(\u03b3(1 + \u03b22\u03b32)\u22121, 1\u03b4K2 ). Then for K = \u2126\u0303(\u03b5 \u22126d5) we have:\nTV (\u00b5S-ProxLMCK , \u03c0) \u2264 \u03b5. 5 ZERO-ORDER CONSTRAINED LANGEVIN DYNAMICS\nIn zero-order optimization (Nesterov & Spokoiny, 2017; Duchi et al., 2015; Ghadimi & Lan, 2013; Shen et al., 2019), one considers the Gaussian smoothed potential U\u03bd defined as U\u03bd(x) = Eg\u223cN (0,Id)U(x+ \u03bdg), and its gradient is given by\u2207xU\u03bd(x) = Eg U(x+\u03bdg)\u2212U(x) \u03bd g. The following is a Monte Carlo estimate of\u2207xU\u03bd(x): G\u0302nU(x) = 1\nn n\u2211 j=1 ( U(x+ \u03bdgj)\u2212 U(x) \u03bd ) gj , (13)\nwhere g1, . . . gn are i.i.d. standard normal vectors.\nZero-Order sampling from log-concave densities was recently studied in Shen et al. (2019). We extend it here to the constrained sampling case of log-concave densities with compact support. We define Constrained Zero-Order Projected LMC (Z-PLMC) and Zero-Order Proximal LMC (Z-ProxLMC) by setting G(x) = G\u0302nU(x) in Eq. 9 and Eq. 10 respectively. Lemma 2 (Zero-Order Gradient Approximation(Nesterov & Spokoiny, 2017; Shen et al., 2019)). Under Assumption B, we have for all x \u2208 \u2126:\nEg1,...,gn \u2225\u2225\u2225G\u0302nU(x)\u2212\u2207xU(x)\u2225\u2225\u22252 \u2264 (\u03b2\u03bd(d+ 2)3/2 + (d+ 1) 12L)2 /n (14)\nThanks to Lemma 2 that ensures uniform approximation of gradients in expectation, we can apply Theorem 2 and get the following corollary for Z-PLMC and Z-ProxLMC: Corollary 1 (Zero-order Constrained Langevin approximates the Gibbs distribution). Under As-\nsumptions A and B, let \u03b4 \u2208 [0, 1], for n \u2265 ( \u03b2\u03bd(d+ 2)3/2 + (d+ 1) 1 2L )2 /\u03b4, we have the following\nbounds in expectation:\n1. Set \u03bb = 1, and \u03b7 = \u0398\u0303(min(R2/K,\u03b1/K2)) where \u03b1 = 1/(\u03b4 + \u03b22R2) . For K = \u2126\u0303(\u03b5\u221212d12), we have:\nEg1,...gnTV (\u00b5Z-PLMCK , \u03c0) \u2264 \u03b5. (15)\n2. Set \u03bb = 1, and \u03b7 = min(\u03b3(1 + \u03b22\u03b32)\u22121, 1\u03b4K2 ). For K = \u2126\u0303(\u03b5 \u22126d5) we have: Eg1,...gnTV (\u00b5Z-ProxLMCK , \u03c0) \u2264 \u03b5. (16) Remark 3. For simplicity, we state the above bound in terms of expectations over the randomness in estimating the gradients. It is possible to get finite-sample bounds using the Vector Bernstein concentration inequality, coupled with covering number estimates of \u2126 but omit them due to space. 6 SURROGATE MODEL BASED CONSTRAINED LANGEVIN DYNAMICS\nDespite its theoretical guarantees, zero-order constrained Langevin (Z-PLMC and Z-ProxLMC) has a prohibitive computation cost as it needs O(nK) black-box queries (in our case, invocations of a nonlinear PDE solver). To alleviate this issue, we introduce in this Section a neural surrogate model as an alternative to the gradient of the true potential. 6.1 HERMITE LEARNING OF GRADIENTS: JACOBIAN MATCHING OF ZERO-ODER ESTIMATES\nFrom Theorem 2, we saw that in order to guarantee the convergence of constrained Langevin dynamics, we need a good estimate of the gradient of the potential of the Gibbs distribution. Recall that the potential given in Lemma 1 depends on \u03c8j and \u03c6k, which are scalar outputs of computationally heavy PDE solvers in our material design problem. To avoid this, we propose to train surrogate neural network models approximating each PDE output and their gradients. Concretely, suppose we are given a training set S for a PDE solver for the property \u03c8 (dropping the index j for simplicity):\nS = {(xi, yi = \u03c8(xi), y\u0303i = G\u0302n\u03c8(xi)), xi \u223c \u03c1\u2126i.i.d., i = 1, . . . , N},\nwhere \u03c1\u2126 is the training distribution and G\u0302n\u03c8(.) is the zero-order estimate of the gradient of \u03c8 given in Eq. 13. We propose to learn a surrogate model belonging to a function class H\u03b8, f\u0302\u03b8 \u2208H\u03b8, that regresses the value of \u03c8 and matches the zero-order gradient estimates as follows:\nmin f\u03b8\u2208H\u03b8\n1\nN N\u2211 i=1 {(yi \u2212 f\u03b8(xi))2 + \u2016\u2207xf\u03b8(xi)\u2212 y\u0303i\u20162} (Z-Hermite Learning) (17)\nThe problem in Eq. 17 was introduced and analyzed in Shi et al. (2010) where H\u03b8 is a ball in a Reproducing Kernel Hilbert Space (RKHS). Following Shi et al. (2010), we refer to this type of learning as Hermite Learning. In the deep learning community, this type of learning is called Jacobian matching and was introduced in Srinivas & Fleuret (2018); Czarnecki et al. (2017) where H\u03b8 is a deep neural network parameterized with weights \u03b8. When f\u03b8 is a deep network, we can optimize this objective efficiently using common deep learning frameworks (PyTorch, TensorFlow).\n(Shi et al., 2010) have shown that when H\u03b8 is an RKHS ball and when y\u0303i = \u2207x\u03c8(xi) are exact gradients, for a sufficiently large training set with N = O(1/ 1/(2r\u03b6)) (where r, \u03b6 are exponents in [0, 1] that depend on the regularity of the function \u03c8). Under the assumption that \u03c8 \u2208H\u03b8 we have:\u222b\n\u2126 \u2016\u2207xf\u03b8(x)\u2212\u2207x\u03c8(x)\u20162 \u03c1\u2126(x)dx \u2264 . Since we are using inexact zero-order gradients, we will\nincur an additional numerical error that is also bounded as shown in Lemma 2. 6.2 TAYLOR LEARNING OF GRADIENTS\nWhile Jacobian matching of zero-order gradients is a sound approach, it remains expensive to construct the dataset, as we need for each point to have 2n+ 1 queries of the PDE solver. We exploit in this section the Taylor learning framework of gradients that was introduced in Mukherjee & Zhou (2006); Mukherjee & Wu (2006), and Wu et al. (2010). In a nutshell, Mukherjee & Zhou (2006) suggests to learn a surrogate potential f\u03b8 and gradient G\u039b that are consistent with the first-order taylor expansion. Given a training set S = {(xi, yi = \u03c8(xi)), x \u223c \u03c1\u2126, i = 1 . . . N}, Wu et al. (2010) suggest the following objective:\nmin f\u03b8\u2208H\u03b8,G\u039b\u2208H d\u039b\n1\nN2 \u2211 i,j w\u03c3ij(yi \u2212 f\u03b8(xj) + \u3008G\u039b(xi), xj \u2212 xi\u3009)2(Taylor-2), (18)\nwhere w\u03c3ij = exp ( \u2212\u2016xi\u2212xj\u20162 \u03c32 ) , H\u03b8 is an RKHS ball of scalar valued functions, and H d\u039b is an RKHS ball of vector valued functions. Under mild assumptions, Mukherjee & Zhou (2006) shows that we have for N = O(1/ d/2):\u222b \u2126 \u2016G\u039b(x)\u2212\u2207x\u03c8(x)\u20162 \u03c1\u2126(x)dx \u2264 .We simplify the problem in Eq. 18 and propose the following two objective functions and leverage the deep learning toolkit to parameterize the surrogate f\u03b8:\nmin f\u03b8\u2208H\u03b8\n1\nN2 \u2211 i,j w\u03c3ij(yi \u2212 f\u03b8(xj) + \u3008\u2207xf\u03b8(xi), xj \u2212 xi\u3009)2(Taylor-1), (19)\nmin f\u03b8\u2208H\u03b8\n1\nN N\u2211 i=1 {(yi\u2212f\u03b8(xi))2+ \u03bb N2 \u2211 i,j w\u03c3ij(yi\u2212yj+\u3008\u2207xf\u03b8(xi), xj \u2212 xi\u3009)2}, (Taylor-Reg) . (20)\nThe objective in Eq. 19 uses a single surrogate to parameterize the potential and its gradient. The objective in Eq. 20 is similar in spirit to the Jacobian matching formulation in the sense that it adds a regularizer on the gradient of the surrogate to be consistent with the first-order Taylor expansion in local neighborhoods. The advantage of the Taylor learning approach is that we do not need to perform zero-order estimation of gradients to construct the training set and we rely instead on first-order approximation in local neighborhood. 6.3 SURROGATE MODEL CONSTRAINED LMC\nConsider the surrogate model f\u03b8 obtained via Hermite Learning (Eq. 17) or via Taylor learning (Eqs 18, 19, 20). We are now ready to define the surrogate model LMC by replacing G(x) = \u2207xf\u03b8(x) in the constrained Langevin dynamics in Eqs 9 and 10.\nBoth Hermite and Taylor learning come with theoretical guarantees when the approximation function space is an RKHS under some mild assumptions on the training distribution and the regularity of the target function \u03c8. In Hermite learning (Theorem 2 in Shi et al. (2010)) we have: Ex\u223cp\u2126 \u2016\u2207xf\u03b8(x)\u2212\u2207x\u03c8(x)\u2016\n2 \u2264 for sufficiently large training set N = O(1/ 1/(2\u03b6r)) (where exponents \u03b6, r \u2208 [0, 1] depend on regularity of \u03c8). In Taylor Learning with the objective function given in Eq. 18 (Proposition 7 in Wu et al. (2010) we have: Ex\u223c\u03c1\u2126 \u2016G\u039b(x)\u2212\u2207x\u03c8(x)\u2016\n2 \u2264 for N = O(1/ d/2). In order to apply Theorem 2 we need this gradient approximation error to hold in expectation on all intermediate distributions in the Langevin sampling. Hence, we need the following extra-assumption on the training distribution p\u2126:\nAssumption D: Assume we have a learned surrogate G on training distribution \u03c1\u2126 such that Ex\u223c\u03c1\u2126 \u2016G(x)\u2212\u2207xU(x)\u2016\n2 \u2264 . Assume \u03c1\u2126(x) > 0,\u2200x \u2208 \u2126 and that it is a dominating measure of Langevin (PLMC, S-PLMC, Prox-LMC, S-ProxLMC ) intermediate distributions \u00b5k, i.e. there exists C > 0 such that:\n\u00b5k(x) \u2264 C\u03c1\u2126(x),\u2200x \u2208 \u2126,\u2200k = 0, . . .K \u2212 1. Under Assumption D, it follows immediately that\nE \u2016G(Xk)\u2212\u2207U(Xk)\u20162 = \u222b\n\u2126\n\u2016G(x)\u2212\u2207U(x\u20162 \u00b5k(x) \u03c1\u2126(x) \u03c1\u2126(x) \u2264 C\nand hence we can apply Theorem 2 for \u03b4 = C , and we obtain \u03b5-approximation of the target Gibbs distribution in terms of total variation distance. Remark 4. Assumption D on the -approximation of the gradient can be achieved for a large enough training set N , when we use Hermite learning in RKHS under mild assumptions and in Taylor learning. The assumption on the dominance of the training distribution is natural and means that we need a large training set that accounts to what we may encounter in Surrogate LMC iterations.\nIn what follows we refer to surrogate constrained LMC, as x-PLMC or x-ProxLMC where x is one of four suffixes ({Z-Hermite, Taylor-2, Taylor-1, Taylor-Reg}). 8 EXPERIMENTS\nIn this section, we demonstrate the usability of our black-blox Langevin sampling approach for the design of nano-porous configurations. We first show the performance of the surrogate models in learning the potential function, showcasing the results using four different variants: standard regression, Taylor regularization, Taylor-1 and Taylor-2. We then show how well the surrogate-based Langevin MC generates new samples under the thermal and mechanical constraints. We compare the sample quality on multiple criteria between the surrogate and zero-order approaches with either projection or proximal update step.\nData. We want to learn surrogate models to approximate the gradient of the potential from data. To this end, we generate a dataset of 50K nano-porous structures, each of size 100nm \u00d7 100nm. One such example is displayed in Fig. 1. Number of pores is fixed to 10 in this study and each pore is a square with a side length of 17.32nm. We sample the pore centers uniformly over the unit square and construct the corresponding structure after re-scaling them appropriately. Then, using the solvers OpenBTE (Romano & Grossman, 2015) and Summit ( \u2211 MIT Development Group, 2018), we obtain for each structure x a pair of values: thermal conductivity \u03ba and von Mises stress \u03c3. Finally, we collect two datasets: {(xi, \u03bai)}Ni=1 and {(xi, \u03c3i)}Ni=1 with the same inputs xi\u2019s and N = 50K samples. More details are given in Appendices B and C on the PDEs and their corresponding solvers.\nFeatures. The pore locations are the natural input features to the surrogate models. Apart from the coordinates, we also derive some other features based on physical intuitions. For example, the distances between pores and the alignment along axes are informative of thermal conductivity (Romano & Grossman, 2016). As such, we compute pore-pore distances along each coordinate axis and add them as additional features.\nSurrogate gradient methods. We use feed-forward neural networks to model the surrogates since obtaining gradients for such networks is efficient thanks to automatic differentiation frameworks. We use networks comprised of 4 hidden layers with sizes 128, 72, 64, 32 and apply the same architecture to approximate the gradients for \u03ba and \u03c3 separately. The hidden layers use ReLU activations whereas sigmoid was used at the output layer (after the target output is properly normalized). For the Taylor-2 variant (in Eq. 18), we have an additional output vector of the same size as the input for the gradient prediction. The networks are trained on the corresponding objective functions set up earlier by an\nAdam optimizer with learning rate 10\u22124 and decay 1.0. We fine-tune the networks with simple grid-search and select the best models for comparison. Due to the space constraint, we present the results in Appendix A and emphasize that Z-Hermite is not included in the entire comparison but in a small experiment performed with a more lightweight OpenBTE version.\nIncorporating constraints and comparison metrics. We demonstrate the usability of our proposed black-box Langevin sampling for the design of nano-configurations under thermal conductivity and mechanical stability constraints that are provided by the corresponding PDE solvers. To compare sampling outcomes, we use the following metrics. We report the minimum value of \u03ba and Monte Carlo estimates for both \u03ba and \u03c3 to compare the samples generated by different sampling methods and surrogate models. The Monte Carlo estimates are computed on 20 samples.\nSingle constraint. Our first task is to design nano-configurations under the thermal conductivity constraint where we want \u03ba as low as possible in order to achieve high thermo-electric efficiency. From the posterior regularization formulation Section 2, we pose the constraint satisfaction as sampling from the following Gibbs distribution:\n\u03c0(x) = p0(x) exp(\u2212\u03bb\u03ba(x)2)\nZ 1x\u2208[0,1]20 (21)\nwhere p0(x) is the uniform distribution over the unit square, which is equivalent to the Poisson process of 10 pores on the square, and \u03ba(x) is the thermal conductivity we want to minimize. Starting from 20 samples initialized from p0(x), we run our proposed black-box Langevin MCs and obtain 20 new realizations from the target distribution \u03c0(x). We use four different surrogates (including simple regression, Taylor-Reg, Taylor-1 and zero-order) and each surrogate with either projection or proximal update. We show the summary statistics of these samples in Table 1. The regression-PMLC in the first row and regression-ProxLMC in the fifth represent the sampling where the surrogate model are fitted on solely the mean square error objective. In all methods, we set \u03bb = 100, the step size \u03b7 = 1e\u22123 and the exponential decay rate 0.8. Since keeping track of the true \u03ba value is expensive, we stop after K = 10 iterations. We first observe that the regression-based method (PLMC, ProxLMC) is less effective than the others simply because they do not have an implicit objective for approximating the gradients. Taylor-Reg and Taylor-1 demonstrate its effectiveness in approximating the gradient and are able to achieve lower thermal conductivity. In particular, Taylor-1-ProxLMC and Zero-order-PLMC perform in the similar range in terms of the minimum achieved, but the learned surrogate offers 17x speed up (per sample) over zero order methods. Due to the space limit, we do not report Taylor-2 results in Table 1, and note that Taylor-2 works in the similar vein as Taylor-1.\nMultiple constraints. Achieving the minimal thermal conductivity can be fulfilled without much difficulty (e.g. structures with all pores aligned along the vertical axis), but such structures are often mechanically unstable. In the next step, we study whether adding more (conflicting) constraints helps us design better nano-configurations. Hence, we consider both thermal conductivity \u03ba and mechanical stability provided via von Mises stress \u03c3. We want a sample x that minimizes \u03ba(x) to achieve high thermo-electric efficiency while maintaining \u03c3(x) less than some threshold (which we explain below). Like the single constraint case, we pose this as sampling from the following Gibbs distribution:\n\u03c0(x) = p0(x) exp(\u2212\u03bb1\u03ba(x)2 \u2212 \u03bb2[\u03c3(x)\u2212 \u03c4 ]+)\nZ 1x\u2208[0,1]20 , (22)\nwhere p0(x) is the same as above, \u03c3(x) is the von Mises stress and \u03c4 is a threshold on the maximum value of \u03c3. With this framework, we relax the inequality constraint to the Hinge loss term on von Mises stress. The results are summarized in Table 2. Note that all the surrogate Langevin MCs are initialized from the same set of 20 samples as above. In this experiment, we set \u03c4 = 0.5, \u03bb1 = 100, \u03bb2 = 10 the step size \u03b7 = 1e\u22123 and the exponential decay rate 0.8. Comparing with Table 1, one can see that not only better \u03ba be achieved but also the \u03c3 can be reduced simultaneously. These results suggest that our approach can effectively sample new configurations under multiple competing constraints. Examples of new nano-configurations are show in Fig. 1 and Appendix A Fig. 5, 6 and 7. 9 CONCLUSION\nIn this paper we introduced Surrogate-Based Constrained Langevin Sampling for black-box sampling from a Gibbs distribution defined on a compact support. We studied two approaches for defining the surrogate: the first through zero-order methods and the second via learning gradient approximations using deep neural networks. We showed the proofs of convergence of the two approaches in the log-concave and smooth case. While zero-order Langevin had prohibitive computational cost, learned surrogate model Langevin enjoy a good tradeoff of lightweight computation and approximation power. We applied our black-box sampling scheme to the problem of nano-material configuration design, where the black box constraints are given by expensive PDE solvers, and showed the efficiency and the promise of our method in finding optimal configurations. Among different approaches for approximating the gradient, the zero-order ones (PLMC, ProxLMC) show overall superior performance, at a prohibitive computational cost. We established that the deep the surrogate (Taylor-1\nProxLMC) is a viable alternative to zero-order methods, achieving reasonable performance, and offering 15x speedup over zero-order methods. A SUPPLEMENTAL EXPERIMENTAL RESULTS\nSurrogate gradient methods We use feed-forward neural networks to model the surrogates since obtaining gradients for such networks is efficient thanks to automatic differentiation frameworks. We use networks comprised of 4 hidden layers with sizes 128, 72, 64, 32 and apply the same architecture to approximate the gradients for \u03ba and \u03c3 separately. The hidden layers compute ReLU activation whereas sigmoid was used at the output layer (after the target output is properly normalized). For the Taylor-2 variant (in Eq. 18), we have an output vector for the gradient prediction. The networks are trained on the corresponding objective functions set up earlier by Adam optimizer with learning rate 10\u22124 and decay 1.0. We fine-tune the networks with simple grid-search and select the best models for comparison.\nAs emphasized throughout, our focus is more on approximating the gradient rather than learning the true function. However, we need to somehow evaluate the surrogate models on how well they generalize on a hold-out test set. Like canonical regression problems, we compare the surrogate variants against each other using root mean square error (RMSE) on the test set. Figures 2 and 3 shows the results. The left figure shows RMSE for predicting \u03ba and the right one shows RMSE for the von Mises stress \u03c3. We can see that the Taylor-Reg generalizes better and also converges faster than Taylor-1 and Taylor-2 to target RMSE for \u03ba, while all methods result similarly for \u03c3 prediction. This is reasonable because the objectives of Taylor-1 and Taylor-2 are not to optimize the mean square error, which we evaluate on here. Figure 3 shows the learning in terms of sample complexity. Again, Taylor-Reg outperforms Taylor-1 and Taylor-2 for \u03ba prediction. In contrast, most models work similarly for \u03c3 regression, particularly when the training size is reduced to 50% (25K).\nEffectiveness of Z-Hermite learning Notice that Z-Hermite learning is not included in this comparison and as a surrogate model in the black-blox Langevin sammpling in Section 8. The reason is that apart from the usual sample pair (xi, yi), we need the gradient y\u0303i (See Eq. 17). Since we can query the solvers, this gradient can only be estimated using finite difference. For both \u03ba and \u03c3 in our experiment, obtaining such data is extremely expensive. As a consequence, we do not have the full results of the Z-Hermite model. Instead, we ran a separate study to show the effectiveness of Z-Hermite surrogate LMC on a smaller data with a lightweight OpenBTE version (0.9.55). The results in Table 3 shows the working of Z-Hermite learning in learning the gradient of \u03ba(x). Here, the entropy is based nearest neighbor estimate to demonstrate the diversity of the pore centers in the unit square. With the (xp, yp)-coordinates of each pore p, the entropy estimate is given by:\nH = 1\nn n\u2211 i=1 log(nmin j 6=i \u2016pi \u2212 pj\u2016) + log 2 + C.\nA hybrid algorithm between zero-order and Taylor-1 surrogate We can see in Tables 1, 2 and 3 the trade-off between computation and accuracy of our approach. While zero-order PLMC and ProxLMC can achieve the lowest thermal conductivity, their computational costs are prohibitive. In contrast, deep surrogate models (including Taylor-Reg, Taylor-1) are far more time-efficient but slightly worse in terms of achieving the optimal \u03ba. To mitigate the trade-off, we propose a simple hybrid method that combines the best of the zero-order and Taylor-1 surrogate models. The algorithm is shown in Figure A that alternates between using the gradient from the zero-order estimate and the gradient of the deep surrogate depending on whether taking this step would decrease the potential function (i.e. \u03ba). We show and compare the achieved \u03ba and running time in Table 3. Examples of the samples generated by Zero-order PLMC, Taylor-1 PLMC and the hybrid method are also depicted in Figure 4. The hybrid achieves the thermal conductivity that is lower than Taylor-1 PMLC while running almost 2x faster than zero-order PLMC. This suggests that the hybrid strategy offers a better trade-off in accuracy and computation. One way to further improve the hybrid is to collect the zero-order gradients while mixing and re-update the surrogate with Z-Hermite learning.\nAlgorithm 1 A hybrid PLMC algorithm alternating between zero-order and Taylor-1 surrogate gradients.\nTrain a network f\u03b8(x) with Taylor-1 Randomly sample x0 from the uniform p(x) Perform a Langevin dynamic step for t = 1, 2, . . . ,K do\nif \u03ba(x\u2212 \u03b7\u2207xf\u03b8(x)) < \u03ba(x) then x\u2190 P\u2126(x\u2212 \u03b7\u2207xf\u03b8(x) + \u221a 2\u03b7\u03be) else estimate \u2207\u0303\u03ba(x) using zero-order method update x\u2190 P\u2126(x\u2212 \u03b7\u2207\u0303\u03ba(x) + \u221a 2\u03b7\u03be)\nend if end for Return a new sample x\nAdditional generated samples We show additional configurations generated by our sampling approach (Taylor-Reg ProxLMC, Taylor-1 ProxLMC and Zero-order ProxLMC) in Fig. 5, 6 and 7. B BACKGROUND ON MODELING NANOSCALE HEAT TRANSPORT\nAt the nanoscale, heat transport may exhibit strong ballistic behaviour and a non-diffusive model must be used (Chen, 2005). In this work we use the Boltzmann transport equation under the relaxation time approximation and in the mean-free-path (MFP) formulation (Romano & Grossman, 2015)\n\u039bs\u0302 \u00b7 \u2207T (\u039b) + T (\u039b) = \u222b \u03b1(\u039b\u2032)\u3008T (\u039b\u2032)\u3009d\u039b\u2032, (23)\nwhere T (\u039b) is the effective temperature associated to phonons with MFP \u039b and direction s\u0302; the notation \u3008.\u3009 stands for an angular average. The coefficients \u03b1(\u039b\u2032) are given by\n\u03b1(\u039b\u2032) = K(\u039b\u2032)\n\u039b\u2032\n[\u222b K(\u039b\u2032\u2032)\n\u039b\u2032\u2032 d\u039b\u2032\u2032\n]\u22121 , (24)\nwhere K(\u039b\u2032) is the bulk MFP distribution. In general, such a quantity can span several orders of magnitude; however, for simplicity we assume the gray model, i.e. all phonons travel with the same MFP, \u039b0. Within this approximation, we have K(\u039b) = \u03babulk\u03b4(\u039b\u2212 \u039b0). In this work we choose \u039b0 = 10 nm, namely as large as the unit cell, so that significant phonons size effects occur. With no loss of generality, we set \u03babulk = 1 Wm\u22121K\u22121 . Eq. 23 is an integro-differential PDE, which is solved iteratively for each phonon direction over an unstructured mesh (Romano & Di Carlo, 2011). We apply periodic boundary conditions along the unit cell while imposing a difference of temperature of \u2206T = 1 K along the x-axis. At the pores\u2019 walls we apply diffusive boundary conditions. Upon convergence, the effective thermal conductivity is computed using Fourier\u2019s law, i.e.\n\u03baeff = \u2212 L\n\u2206TA \u222b A J \u00b7 n\u0302dS, (25)\nwhere J = (\u03babulk/\u039b0)\u3008T (\u039b0)\u0302s\u3009n\u0302 is the heat flux, L is the size of the unit cell, A is the area of the cold contact (with normal n\u0302). Throughout the text we use the quantity \u03ba = \u03baeff/\u03babulk as a measure of phonon size effects. C BACKGROUND ON MODELING MECHANICAL STRESS\nWe model mechanical stress by using the continuum linear elasticity equations\n\u2202\n\u2202xj \u03c3ij = fi, (26)\nwhere fi is the body force (which is zero in this case), and \u03c3ij is the stress tensor. Note that we used the Einstein notation, i.e. repeated indexes are summed over. The strain kl is related to the stress via the fourth-rank tensor elastic constant Cijkl\n\u03c3ij = Cijkl kl. (27)\nThe strain is then related to the displacement u via\nkl = 1\n2 ( \u2202uk \u2202xl + \u2202ul \u2202uk ) . (28)\nWe apply periodic boundary conditions along the unit-cell and applied solicitation is a small in-plane expansion. Once the stress tensor is calculated, we compute the von Mises stress as\n\u03c3VM =\n\u221a 1\n2 (\u03c33 \u2212 \u03c32)2 + (\u03c33 \u2212 \u03c31)2 + (\u03c32 \u2212 \u03c31)2, (29)\nwhere \u03c3i are the principal stress axis. As a mechanical stability estimator we use \u03c3 = maxx\u2208D(\u03c3VM ) where D is the simulation domain. To avoid material\u2019s plasticity, \u03c3 needs to be smaller than the yield stress of a given material. For mechanical simulation we used the SUMIT code ( \u2211 MIT Development Group, 2018). D BACKGROUND ON STOCHASTIC DIFFERENTIAL EQUATIONS (SDE): CHANGE OF MEASURE AND GRISANOV\u2019S FORMULA\nTheorem 3 (Grisanov Theorem, Change of Measure for Brownian Motion (Lipster & Shiryaev, 2001), Theorem 6.3 page 257). Let (Wt,Ft) be a Wiener process (Brownian motion) and (\u03b2t,Ft) a random process such that for any T > 0\u222b T\n0\n\u2016\u03b2t\u20162 dt <\u221e a.s\nThen the random process : dW\u0303t = dWt \u2212 \u03b2tdt or written equivalently: W\u0303t = Wt \u2212 \u222b t\n0 \u03b2sds, is a\nWiener process with respect to Ft, t \u2208 [0, T ]. Let PWT = L (W[0,T ]), and P W\u0303T = L (W\u0303[0,T ]) the densities are given by: dP W\u0303 T\ndPWT = exp\n(\u222b T 0 \u3008\u03b2s, dWs\u3009 \u2212 12 \u222b T 0 \u2016\u03b2s\u20162 ds ) . It follows that:\nKL(PWT , P W\u0303 T ) =\n1 2 EPWT [\u222b T 0 \u2016\u03b2s\u20162 ds ] (30)\nTheorem 4 (Grisanov Theorem, Change of Measure for Diffusion Processes, (Lipster & Shiryaev, 2001), ()). Let (Xt)t\u22650 and (Yt)t\u22650\ndXt = \u03b1t(X)dt+ dWt\ndYt = \u03b2t(Y )dt+ dWt\nwhere X0 = Y0 is an F0 measurable random variable. Suppose that the non-anticipative functionals \u03b1t(x) and \u03b2t(x) are such that a unique continuous strong solutions exits for both processes. If for any T > 0:\u222b T\n0\n\u2016\u03b1s(X)\u20162 + \u2016\u03b2s(X)\u20162 ds <\u221e(a.s) and \u222b T\n0\n\u2016\u03b1s(Y )\u20162 + \u2016\u03b2s(Y )\u20162 ds <\u221e(a.s).\nLet PXT = L (X[0,T ]), and P Y T = L (Y[0,T ]).\ndPYT dPXT (X) = exp\n( \u2212 \u222b T\n0\n\u3008\u03b1s(X)\u2212 \u03b2s(X), dXs\u3009+ 1\n2 \u222b T 0 (\u2016\u03b1s(X)\u2016 \u2212 \u2016\u03b2s(X)\u20162)ds ) .\nKL(PXT , P Y T ) =\n1 2 EPXT [\u222b T 0 \u2016\u03b1s(X)\u2212 \u03b2s(X)\u20162 ds ] . (31) E BACKGROUND ON ZERO-ORDER OPTIMIZATION (GRADIENT-FREE)\nConsider the smoothed potential U\u03bd defined as follows:\nU\u03bd(x) = Eg\u223cN (0,Id)U(x+ \u03bdg)\nits gradient is given by:\n\u2207xU\u03bd(x) = Eg U(x+ \u03bdg)\u2212 U(x)\n\u03bd g,\nA monte carlo estimate of\u2207xU\u03bd(x) is:\nG\u0302n(x) = 1\nn n\u2211 j=1 ( U(x+ \u03bdgj)\u2212 U(x) \u03bd ) gj ,\nwhere g1, . . . gn are iid standard Gaussians vectors.\nUsing known results in zero order optimization under assumptions on smoothness and bounded gradients of the gradients we have for all x ((Nesterov & Spokoiny, 2017; Shen et al., 2019)):\nEg \u2225\u2225\u2225G\u03021(x)\u2212\u2207xU(x)\u2225\u2225\u22252 \u2264 (\u03b2\u03bd(d+ 2)3/2 + (d+ 1) 12 \u2016\u2207xU(x)\u2016)2 \u2264 (\u03b2\u03bd(d+ 2)3/2 + (d+ 1) 12L)2\nFinally by independence of u1, . . . un we have:\nEg1,...,gn \u2225\u2225\u2225G\u0302n(x)\u2212\u2207xU(x)\u2225\u2225\u22252 \u2264\n( \u03b2\u03bd(d+ 2)3/2 + (d+ 1) 1 2L )2\nn (32) F PROOFS\nProof of Lemma 1. Define the Lagrangian:\nL(q, \u03b7) = \u222b \u2126 log ( q(x) p0(x) ) q(x)dx+ Ce\u2211 j=1 \u03bbj \u222b \u2126 (\u03c8j(x)\u2212 yj)2q(x)dx\n+ Ci\u2211 k=1 \u03bbk \u222b x\u2208\u2126 (\u03c6k(x)\u2212 bk)+q(x)dx+ \u03b7 ( 1\u2212 \u222b x\u2208\u2126 q(x) ) Setting first order optimality conditions on q, we have for x \u2208 \u2126:\nlog\n( q(x)\np0(x)\n) + 1 + C\u2211 j=1 \u03bbj(\u03c8j(x)\u2212 yj)2 + Ci\u2211 k=1 \u03bbk(\u03c6k(x)\u2212 bk)+ \u2212 \u03b7 = 0\nHence we have:\nq(x) = p0(x) exp\n( \u2212 \u2211Ce j=1 \u03bbj(\u03c8j(x)\u2212 yj)2 \u2212 \u2211Ci k=1 \u03bbk(\u03c6k(x)\u2212 bk)+ ) e exp\u2212\u03b7 , x \u2208 \u2126\nand q(x) = 0, x /\u2208 \u2126,\nFirst order optimality on \u03b7 give us: \u222b\n\u2126 q(x) = 1, we conclude by setting e exp(\u2212\u03b7) = Z.\nProof of Theorem 1 1) Projected Langevin. Let us define the following continuous processes by interpolation of Xk and YK (Piecewise constant):\ndX\u0303t = P\u2126(U\u0303t(X\u0303)dt+ \u221a 2\u03bbdWt)\nwhere U\u0303t(X\u0303) = \u2212 \u2211\u221e k=0\u2207xU(X\u0303k\u03b7)1t\u2208[k\u03b7,(k+1)\u03b7](t). Similarly let us define :\ndY\u0303t = P\u2126(Gt(Y\u0303 )dt+ \u221a 2\u03bbdWt)\nwhere Gt(Y\u0303 ) = \u2212 \u2211\u221e k=0G(Y\u0303k\u03b7)1t\u2208[k\u03b7,(k+1)\u03b7](t).\nIt is easy to see that we have : Xk = X\u0303k\u03b7 and Yk = Y\u0303k\u03b7 .\nLet \u03c0T X\u0303 and \u03c0T Y\u0303 be the distributions of (X\u0303t)t\u2208[0,T ] and (Y\u0303 )t\u2208[0,T ].\nNote that :\ndY\u0303t = P\u2126 ( U\u0303t(X\u0303t)dt+ \u221a 2\u03bb(dWt +\n1\u221a 2\u03bb\n(Gt(Y\u0303t)\u2212 U\u0303t(X\u0303t))dt) )\nLet dW\u0303t = dWt +\n1\u221a 2\u03bb (Gt(Y\u0303t)\u2212 U\u0303t(X\u0303t))dt\nHence we have : dY\u0303t = P\u2126 ( U\u0303t(X\u0303) + \u221a 2\u03bbdW\u0303t ) ,\nAssume that X0 = Y0 there exists Q such that , XT = Q({Wt}t\u2208[0,T ]) and YT = Q((W\u0303t)t\u2208[0,T ]). Let \u00b5X\u0303T be the law of X\u0303t\u2208[0,T ]. Same for \u00b5 Y\u0303 T . The proof here is similar to the proof of Lemma 8 in (Bubeck et al., 2015). By the data processing inequality we have:\nKL(\u00b5X\u0303T , \u00b5 Y\u0303 T ) \u2264 KL(Wt\u2208[0,T ], W\u0303t\u2208[0,T ]),\nNow using Grisanov\u2019s Theorem for change of measure of Brownian Motion (Theorem 3) we have:\nKL(Wt\u2208[0,T ], W\u0303t\u2208[0,T ]) = 1 4\u03bb E \u222b T\n0\n|Gt(Y\u0303t)\u2212 U\u0303t(X\u0303t)|2dt\nConsider T = K\u03b7, hence we have (with some abuse of notation we drop tilde as Yk = Y\u0303k\u03b7):\nKL(\u00b5X\u0303T , \u00b5 Y\u0303 T ) \u2264\n1 4\u03bb E \u222b K\u03b7\n0\n|Gt(Y\u0303t)\u2212 U\u0303t(X\u0303t)|2dt\n= 1 4\u03bb E K\u22121\u2211 k=0 \u222b (k+1)\u03b7 k\u03b7 \u2016G(Yk\u03b7)\u2212\u2207xU(Xk\u03b7)\u20162 dt\n= \u03b7\n4\u03bb K\u22121\u2211 k=0 E \u2016G(Yk\u03b7)\u2212\u2207xU(Xk\u03b7)\u20162\n= \u03b7\n4\u03bb K\u22121\u2211 k=0 E \u2016G(Yk\u03b7)\u2212\u2207xU(Yk\u03b7) +\u2207xU(Yk\u03b7)\u2212\u2207xU(Xk\u03b7)\u20162\n\u2264 \u03b7 2\u03bb K\u22121\u2211 k=0 ( E \u2016G(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u20162 + E \u2016\u2207xU(Yk\u03b7)\u2212\u2207xU(Xk\u03b7)\u20162 )\nwhere in the last inequality we used the fact that ||a\u2212 b||2 \u2264 2(||a||2 + ||b||2). Note that we have by smoothness assumption on U :\n\u2016\u2207xU(Yk\u03b7)\u2212\u2207xU(Xkh)\u20162 \u2264 \u03b22 \u2016Xkh \u2212 Ykh\u20162\nLet R be the diameter of \u2126, we can get a bound as follows:\nKL(\u00b5X\u0303T , \u00b5 Y\u0303 T ) \u2264\n\u03b7\n2\u03bb  K\u22121\u2211 k=0\nE \u2016G(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u20162\ufe38 \ufe37\ufe37 \ufe38 Gradient approximation error\n+\u03b22 K\u22121\u2211 k=0 E \u2016Xkh \u2212 Ykh\u20162  \u2264 \u03b7\n2\u03bb ( K\u22121\u2211 k=0 E \u2016G(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u20162 +K\u03b22R2 )\nNow using Pinsker inequality we have:\nTV (\u00b5X\u0303T , \u00b5 Y\u0303 T ) 2 \u2264 2KL(\u00b5X\u0303T , \u00b5Y\u0303T ) \u2264 \u03b7\n\u03bb ( K\u22121\u2211 k=0 E \u2016G(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u20162 +K\u03b22R2 )\nHence for T = K\u03b7 we have:\nTV (\u00b5S-PLMCK , \u00b5 PLMC K ) \u2264\n\u221a \u03b7\n\u03bb ( K\u22121\u2211 k=0 E \u2016G(Yk)\u2212\u2207xU(Yk)\u20162 +K\u03b22R2 ) 1 2 . (33)\nProof of Theorem 1 2) Proximal LMC. Let us define the following continuous processes by interpolation of Xk and YK (Piecewise constant):\ndX\u0303t = U\u0303t(X\u0303)dt+ \u221a 2\u03bbdWt\nwhere U\u0303t(X\u0303) = \u2212 \u2211\u221e k=0(\u2207xU(X\u0303k\u03b7) + 1 \u03b3 (X\u0303k\u03b7 \u2212 P\u2126(X\u0303k\u03b7)))1t\u2208[k\u03b7,(k+1)\u03b7](t). Similarly let us define : dY\u0303t = Gt(Y\u0303 )dt+ \u221a 2\u03bbdWt\nwhereGt(Y\u0303 ) = \u2212 \u2211\u221e k=0(G(Y\u0303k\u03b7)+ 1 \u03b3 (Y\u0303k\u03b7\u2212P\u2126(Y\u0303k\u03b7)))1t\u2208[k\u03b7,(k+1)\u03b7](t). Now applying Grisanov\u2019s Theorem for diffusions (Theorem 4) we have:\nKL(\u00b5X\u0303T , \u00b5 Y\u0303 T ) =\n1\n4\u03bb EPXT [\u222b T 0 \u2225\u2225\u2225Ut(X\u0303)\u2212Gt(X\u0303)\u2225\u2225\u22252 dt]\n= 1 4\u03bb E K\u22121\u2211 k=0 \u222b (k+1)\u03b7 k\u03b7 \u2225\u2225\u2225G(X\u0303k\u03b7)\u2212\u2207xU(X\u0303k\u03b7)\u2225\u2225\u22252 dt = \u03b7\n4\u03bb K\u22121\u2211 k=0 E \u2225\u2225\u2225G(X\u0303k\u03b7)\u2212\u2207xU(X\u0303k\u03b7)\u2225\u2225\u22252\n= \u03b7\n4\u03bb K\u22121\u2211 k=0 E \u2016G(Xk)\u2212\u2207xU(Xk)\u20162 .\nNow using Pinsker inequality we have:\nTV (\u00b5T X\u0303 , \u00b5T Y\u0303 )2 \u2264 2KL(\u00b5T X\u0303 , \u00b5T Y\u0303 ).\nHence for T = K\u03b7 we have:\nTV (\u00b5S-ProxLMCK , \u00b5 ProxLMC K ) \u2264\n\u221a \u03b7\n2\u03bb ( K\u22121\u2211 k=0 E \u2016G(Xk)\u2212\u2207xU(Xk)\u20162 ) 1 2 . (34)\nProof of Theorem 2 . S-PLMC. If we set \u03bb = 1, \u03b7 \u2264 \u03b1/K2, where \u03b1 = 1/(\u03b4 + \u03b22R2), in this Corollary we obtain that : TV (\u00b5S\u2212PLMCK , \u00b5 PLMC K ) \u2264 1\u221aK . Assuming A, B and C we consider \u03b7 \u2264 min(R2/K,\u03b1/K2), and K = \u2126\u0303(\u03b5\u221212d12). Now using the triangle inequality together with the bounds in Eq.s 7 we have: TV (\u00b5S\u2212PLMCK , \u03c0) \u2264 TV (\u00b5 S\u2212PLMC K , \u00b5 PLMC K ) + TV (\u00b5\nPLMC , \u03c0) \u2264 \u03b5+ 1\u221a\nK .\nS-ProxLMC. We conclude with a similar argument for TV (\u00b5S\u2212ProxLMCK , \u03c0) using Eq.s 8. Considering \u03b7 = min(\u03b3(1 + \u03b22\u03b32)\u22121, 1\u03b4K2 ), and K = \u2126\u0303(\u03b5 \u22126d5), we obtain (\u03b5+ 1\u221a K\n) approximation in TV of the target Gibbs distribution.\nProof of Corollary 1. Z-PLMC: We have:\nTV (\u00b5X\u0303T , \u00b5 Y\u0303 T ) \u2264 \u221a\u221a\u221a\u221a\u03b7 \u03bb ( K\u22121\u2211 k=0 E \u2016GnU(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u20162 +K\u03b22R2 )\nTaking the expectation we have:\nEg1...gnTV (\u00b5X\u0303T , \u00b5Y\u0303T ) \u2264 Eg1...gn \u221a\u221a\u221a\u221a\u03b7 \u03bb ( K\u22121\u2211 k=0 E \u2016GnU(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u20162 +K\u03b22R2 )\n\u2264 \u221a\u221a\u221a\u221a\u03b7 \u03bb ( K\u22121\u2211 k=0 EY Eg1...gn \u2016GnU(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u2016 2 +K\u03b22R2 ) (Jensen inequality)\nNote now that we have: Eg1...gn \u2016GnU(Yk\u03b7)\u2212\u2207xU(Yk\u03b7)\u2016 2 \u2264 \u03b4, \u2200Yk\u03b7.\nFor n \u2265 ( \u03b2\u03bd(d+ 2)3/2 + (d+ 1) 1 2L )2 /\u03b4 The rest of the proof is an application of Theorem 2.\nZ-ProxLMC. A similar argument holds.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Streaming variational bayes , Authors: T. Broderick N. Boyd... , 2013\n2.  Title: Online variational Bayesian learning , Authors: Z. Ghahramani , 2000\n3.  Title: Overcoming catastrophic forgetting in neural networks , Authors: J. Kirkpatrick R. Pascanu... , 2017\n4.  Title: A practical Bayesian framework for backpropagation networks , Authors: D.J.C. MacKay , 1992\n5.  Title: Stochastic Models, Estimation and Control. Mathematics in science and engineering , Authors: P.S. Maybeck , 1982\n6.  Title: Virtual Vector Machine for Bayesian online classification , Authors: T.P. Minka R. Xiang... , 2009\n7.  Title: Variational continual learning , Authors: C.V. Nguyen Y. Li... , 2018\n8.  Title: A Bayesian approach to on-line learning , Authors: M. Opper , 1998\n9.  Title: Connectionist models of recognition memory: constraints imposed by learning and forgetting functions , Authors: R. Ratcliff , 1990\n10.  Title: Online structured laplace approximations for overcoming catastrophic forgetting , Authors: H. Ritter A. Botev... , 2018\n11.  Title: Catastrophic forgetting, rehearsal and pseudorehearsal , Authors: A. Robins , 1995\n12.  Title: Online model selection based on the variational bayes , Authors: M. Sato , 2001\n13.  Title: Continual learning with deep generative replay , Authors: H. Shin J.K. Lee... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["This work addresses continual learning for non-stationary data, using Bayesian neural networks and memory-based online variational Bayes. We represent the posterior approximation of the network weights by a diagonal Gaussian distribution and a complementary memory of raw data. This raw data corresponds to likelihood terms that cannot be well approximated by the Gaussian. We introduce a novel method for sequentially updating both components of the posterior approximation. Furthermore, we propose Bayesian forgetting and a Gaussian diffusion process for adapting to non-stationary data. The experimental results show that our update method improves on existing approaches for streaming data. Additionally, the adaptation methods lead to better predictive performance for non-stationary data.", "There are many Bayesian approaches to online learning, which differ mostly in the approximation of the posterior distribution at each time-step. Sequential Monte Carlo (Liu & Chen, 1998) approximates the posterior by a set of particles. Assumed Density Filtering (ADF) (Maybeck, 1982) and Bayesian online learning (Opper, 1998) are deterministic posterior approximations based on moment matching. Other deterministic approaches are based on Laplace\u2019s approximation (MacKay, 1992): Kirkpatrick et al. (2017) use multiple diagonal Gaussian posterior approximations of previous time-steps to regularise future tasks; Ritter et al. (2018) use a single (block-diagonal) posterior approximation, summarising all previous time-steps. The latter method is closer to Bayesian online inference, as it is an approximation of Eq. (1). Our work is based on online VB (Opper, 1998; Ghahramani, 2000; Sato, 2001; Broderick et al., 2013), which approximates the posterior at every time-step by minimising the KL-divergence between a parametric (here Gaussian) and the true posterior distribution. In contrast to online VB, we approximate the posterior by a Gaussian distribution and a running memory. Other approaches are based on various types of episodic memory, motivated by their empirical success in preventing catastrophic forgetting. The basic idea of rehearsal (Ratcliff, 1990) is to train on both the new data and a subset of previous data or pseudo samples (Robins, 1995; Shin et al., 2017; Kemker & Kanan, 2017) sampled from a generative model. The memory-based online inference methods most similar to our approach are VCL (Nguyen et al., 2018) and VVM (Minka et al., 2009). Both methods use a Gaussian distribution and a running memory to approximate the posterior. VCL uses heuristics such as random selection or the k-center method to update the memory.", "1. Summary:\nThis proposes considers neural networks training with non-stationary streaming data. To address online inference, the paper uses variational online updates and a running memory (coreset) summarising the data points seen so far, as recently used in the variational continual learning approach. First, the paper identifies a gap in this recent approach that coreset point selection process does not take into account the approximation quality. The paper develops a sound procedure to select the memory points, essentially to identify \u201cdifficult\u201d data points under the current posterior and put them into the running memory. Second, to handle \u201cconcept drift\u201d, the paper considers imposing some decaying on the likelihood of past data and derives post-hoc variational posterior updates for this case. Two contributions are validated on some streaming regression and classification tasks.\n\n2. Opinion and rationales\n\nI\u2019m leaning towards \u201caccept\u201d for this paper since it presents two interesting contributions (albeit of incremental novelty)  to the approximate inference area, it has clear execution and super clean presentation, and the experiments clearly demonstrate the values of the proposed approaches.\n\nI would like the paper to clarify the following:\n\na. I find the objective for selecting memory points interesting, but am wondering if using individual predictability terms by themselves are good enough to select these points. Perhaps, after some thoughts, memory is perhaps not the right word to characterise these points. \n\nIt seems to me (i) these terms indicate how well the current posterior predicts the data points and thus this objective will tend to favour points with low predictive likelihood to be selected. That is, these points are important when *all other points* are presented, (ii) there will be *no diversity* in the memory as illustrated in the 2d classification example.\n\nThis means the points selected here have quite different characteristics compared to coreset points or the full dataset in general. Coreset points ideally can compactly represent the full dataset and can be used for inference in place of the full dataset.\n\nThat said, the contribution presented here is very useful. Just that I\u2019m not sure how well this will work in more challenging continual learning set-up where diversity is important for a long sequence of diverse tasks.\n\nb. The paper presents a post-hoc modification to approximate posterior (\u2018s mean and variance) to account for the decay in the likelihood contribution from past data. I wonder if this post-hoc adjustment is identical to running approximate inference with the adjusted likelihood?\n\nc. the adaptation parameters need to be tuned/known in advance, which seems to be a disadvantage of the current approach. Is the update presented here somewhat robust to mis-specification of these parameters? How would these be tuned in a more practical setting in which data arrive sequentially and we might not know the underlying \u201cconcept\u201d.\n\n6: Weak Accept"], "input": "1 INTRODUCTION\nContinual learning (CL), also referred to as lifelong learning, is typically described informally by the following set of desiderata for computational systems: the system should (i) learn incrementally from a data stream, (ii) exhibit information transfer forward and backward in time, (iii) avoid catastrophic forgetting of previous data, and (iv) adapt to changes in the data distribution (Ring, 1997; Silver et al., 2013; Chen & Liu, 2016; Ruvolo & Eaton, 2013; Parisi et al., 2018). The necessity to adapt to non-stationary data is often not reconcilable with the goal of preventing forgetting. This problem is also known as the stability-plasticity dilemma (Grossberg, 1987).\nThe majority of current CL research is conducted in the context of online multi-task learning (Nguyen et al., 2018; Kirkpatrick et al., 2017; Schwarz et al., 2018; Rusu et al., 2016; Fernando et al., 2017), where the main objective is to prevent catastrophic forgetting of previously learned tasks. This focus is reasonable since changes in the statistics of the data distribution are usually an artefact of learning different tasks sequentially. However, changes in the statistics of the data can also be real properties of the data-generating process. Examples include models of energy demand, climate analysis, financial market, or user-behavior analytics (Ditzler et al., 2015). In such applications, the statistics of the current data distribution are of particular interest. Old data may be outdated and can even deteriorate learning if the drift in the data distribution is neglected. Consequently, CL systems for non-stationary data require adaptation methods, which deliberately forget outdated information.\nIn this work, we develop an approximate Bayesian approach for training Bayesian neural networks (BNN) (Hinton & van Camp, 1993; Graves, 2011; Blundell et al., 2015) incrementally with nonstationary streaming data. Similar to variational continual learning (VCL) (Nguyen et al., 2018) and the Virtual Vector Machine (VVM) (Minka et al., 2009), we approximate the posterior using a Gaussian distribution and a complementary memory of previous data. Both components are updated sequentially, while adapting to changes in the data distribution. Our main contributions are as follows:\n\u2022 We propose an online approximation consisting of a diagonal Gaussian distribution and a running memory, and we provide a novel sequential update method for both components. \u2022 We extend the online approximation by two alternative adaptation methods, thereby general-\nising online variational Bayes with Bayesian neural networks to non-stationary data. We compare our sequential update method to VCL in the online-inference setting on several popular datasets, demonstrating that our approach is favorable. Furthermore, we validate our adaptation methods on several datasets with concept drift (Widmer & Kubat, 1996), showing performance improvements compared to online variational Bayes without adaptation. \u2217Correspondence to richard.kurle@tum.de 2 BACKGROUND: ONLINE INFERENCE\nConsider a stream of datasets {Dtk}Kk=1, where tk are the time points at which datasets Dtk are observed. For the moment, we assume that these datasets and the samples within are generated independently and identically distributed (i.i.d.). Methods for non-i.i.d. data are considered in Sec. 4.\nIn the Bayesian approach to online learning, we want to infer the posterior distribution p(w|Dt1:tk) of our model parameters, with the restriction that the data is processed sequentially.1 Using Bayes rule, a recursive posterior inference equation emerges naturally:\np(w|Dt1:tk) \u221d p(w|Dt1:tk\u22121) p(Dtk |w,Dt1:tk\u22121) = p(w|Dt1:tk\u22121) p(Dtk |w), (1) where the last step follows from the i.i.d. assumption of the data.\nIn this paper, we consider Gaussian and multinomial likelihoods, parametrised by a neural network with weights w and prior p(w|\u2205) = p0(w) = N (w;\u00b50, \u03c30). Furthermore, we consider supervised learning, where Dtk = {d (n) tk }n = {(x(n)tk ,y (n) tk )}n and p(d(n)tk |w) = p(y (n) tk |NN ( x (n) tk ;w) ) . 2.1 ONLINE VARIATIONAL BAYES\nSince exact Bayesian inference is intractable for non-trivial models, various approximations have been developed. Prominent examples include sequential Monte Carlo (Liu & Chen, 1998), assumed density filtering (Maybeck, 1982), and online variational Bayes (Opper, 1998; Ghahramani, 2000; Sato, 2001; Broderick et al., 2013). Online variational Bayes (VB) approximates the posterior of Eq. (1) by a parametrised distribution q\u03b8tk (w) \u2248 p(w|Dt1:tk) through a sequence of projections:\nq\u03b8tk (w) = argmin q\u03b8\nKL [ q\u03b8(w) ||Z\u22121tk q\u03b8tk\u22121 (w) p(Dtk |w) ] , (2)\nwhere Ztk is the normalisation constant. The above minimisation is equivalent to maximising the evidence lower bound (ELBO) Ltk(\u03b8;Dtk) = Eq\u03b8(w) [ log p(Dtk |w) ] \u2212KL [ q\u03b8(w) || q\u03b8tk\u22121 (w) ] . In this work, we consider diagonal Gaussian posterior approximations q\u03b8tk (w) for the neural network weights, similar to Nguyen et al. (2018). 2.2 ONLINE VARIATIONAL BAYES WITH MEMORY\nOnline approximate Bayesian inference methods inevitably suffer from an information loss due to the posterior approximation at each time-step. An alternative approach to online learning is to store and update a representative dataset/generative model\u2014and to use it as a memory\u2014in order to improve inference (Robins, 1995; Lopez-Paz & Ranzato, 2017; Shin et al., 2017; Kamra et al., 2017). Memory-based online learning has also been combined with online Bayesian inference methods (Minka et al., 2009; Nguyen et al., 2018). A common property of these approaches is to represent the (current) posterior approximation by a product of two factors\np(w|Dt1:tk) \u2248 q\u03b8tk (w) p(Mtk |w) (3) and update them sequentially as new dataDtk is observed. The factor p(Mtk |w) = \u220fM m p(m (m) tk |w) is the likelihood of a set of M = |M| data points, which we refer to as running memory; and q\u03b8tk (w) is a Gaussian distribution, which summarises the rest of the data D\u03041:tk = D1:tk\\Mtk . In case of VCL, the factors in Eq (3) are updated in two steps, which we refer to as (i) memory update and (ii) Gaussian update: (i) a new memoryMtk \u2282 Dtk \u222a Mtk\u22121 is selected using heuristics such as random selection or the k-center method (a greedy algorithm that selects K data points based on geometric properties of Dtk \u222a Mtk\u22121 .); (ii) the Gaussian distribution is updated with the remaining data D\u0304tk = Dtk \u222a Mtk\u22121 \\ Mtk (using Eq. (2)) to obtain q\u03b8tk (w) \u2248 q\u03b8tk\u22121 (w) p(D\u0304tk |w).\nNote that we cannot sample directly from the posterior approximation in Eq. (3) and thus we cannot easily evaluate quantities such as the posterior predictive distribution. VCL therefore performs a second projection\nq\u0303\u03b8tk (w) = argmin q\u03b8\nKL [ q\u03b8(w) || Z\u0303\u22121tk q\u03b8tk (w) p(Mtk |w) ] . (4)\nThis distribution should not be confused with the recursively updated variational distribution (Eq. (2)). 1 A strict definition of online learning requires single data samples at each time step instead of batches Dtk . 3 IMPROVING MEMORY-BASED ONLINE VARIATIONAL BAYES\nIn this section, we focus on two problems of existing approaches using online VB with a running memory: (i) the memory update does not take into account the approximation error or approximation capabilities of the variational distribution; (ii) the Gaussian update\u2014performed by optimising the ELBO (Eq. (2)) only with data D\u0304tk\u2014can fail for streaming data. This is because VB yields poor posterior approximations if the dataset is too small or the neural network architecture has too much capacity (cf. Ghosh et al. (2018), Fig. 1). In Secs. 3.2 and 3.3, we propose improvements to these two update methods. The mathematical background for our approach is provided in Sec. 3.1. 3.1 PROPERTIES OF THE GAUSSIAN VARIATIONAL APPROXIMATION\nThere are two important properties of the Gaussian variational approximation that we will exploit later: (i) Gaussian approximate posterior distributions factorise into a product of Gaussian terms corresponding to the prior and each likelihood term; (ii) the ELBO can be written as the sum of the approximation\u2019s normalisation constant and a sum of residuals corresponding to these factors.\nLet p0(w) = N (w;\u00b50,\u03a30) be a Gaussian prior and p(D|w) = \u220f n p(d\n(n)|w) be the likelihood of the observed data D. Furthermore, let q\u03b8(w) = N (w;\u00b5,\u03a3) denote the corresponding Gaussian variational approximation with \u03b8 = {\u00b5,\u03a3}. Assume that \u00b5 and \u03a3 are the optimal parameters corresponding to a (local) maximum of the ELBO L(\u00b5,\u03a3;D). The optimality conditions \u2202\u00b5L(\u00b5,\u03a3;D) = 0 and \u2202\u03a3L(\u00b5,\u03a3;D) = 0 can be rewritten as follows (Knowles & Minka, 2011; Opper & Archambeau, 2008; Cseke et al., 2013) (cf. App. C):\n\u03a3\u22121\u00b5 = \u03a3\u221210 \u00b50 + \u2211 n ( \u2202\u00b5Eq\u03b8(w) [ log p(d(n)|w) ] \u2212 2\u2202\u03a3Eq\u03b8(w) [ log p(d(n)|w) ] \u00b5 ) , (5a)\n\u03a3\u22121 = \u03a3\u221210 \u2212 2 \u2211 n \u2202\u03a3Eq\u03b8(w) [ log p(d(n)|w) ] . (5b)\nSince the sum of natural parameters corresponds to a product in distribution space, the above equations show that\u2014at a local optimum\u2014the approximation q\u03b8(w) factorises in the same way as the posterior p(w|D). It can be written in the form q\u03b8(w) = Zq\u22121p0(w) \u220f n r\n(n)(w), where the factors r(n)(w) are Gaussian functions with natural parameters given by Eqs. (5a) and (5b), and where Zq = \u222b p0(w) \u220f n r\n(n)(w) dw is the normalisation constant. These Gaussian functions r(n)(w) each correspond to the contribution of the likelihood p(d(n)|w) to the posterior approximation q\u03b8(w). The resulting factorisation implies that the ELBO L(\u00b5,\u03a3;D) can be written in the form (Opper & Winther, 2005) (c.f. App. D)\nL(\u00b5,\u03a3;D) = logZq + \u2211 n Eq\u03b8(w) [ log p(d(n)|w)\u2212 log r(n)(w) ] . (6)\nIf the terms p(d(n)|w) were (diagonal) Gaussian in w, they would each cancel with the corresponding (diagonal) Gaussian term, leaving only logZq. Intuitively, the residual terms in Eq. (6) can be used to quantify the quality of the Gaussian approximation. 3.2 MEMORY UPDATE\nThe authors of VCL propose to use a memory to compensate the information loss resulting from the Gaussian approximation of the posterior distribution. However, their memory update is independent of the approximation error that is due to the chosen distributional family (diagonal Gaussian). An alternative memory update, which specifically targets the above mentioned information loss, has been introduced previously for VVM. Although the latter method was developed for expectation propagation in a (linear) logistic regression model\u2014and is thus not directly applicable to online VB\u2014 we show that some of its properties can be transferred to the variational inference setting. The central idea is to replace the likelihood terms that can be well approximated by a Gaussian distribution by their Gaussian proxies p(dtk |w) \u2248 rtk(w;dtk) resulting in q\u03b8tk (w); and retain the data corresponding to the rest of the likelihood terms in the memory. To score a candidate memory, Minka et al. (2009) proposed to maximise the KL divergence between the model given in the form of Eq. (3)\nand a Gaussian posterior approximation, that is, maximise KL [ Z\u0303\u22121tk q\u03b8tk (w) p(M|w) || q\u0303\u03b8tk (w) ] . However, this score function is intractable, because the expectation in the KL includes the likelihood p(M|w). In the following, we develop a tractable score function applicable to VB. Intuitively, we can use Eq. (6) to test how much L(\u00b5,\u03a3;D) changes if we replace the exact likelihood terms (of all data which is not contained in the candidate memory) by their Gaussian approximations.\nTo achieve this, we need to find Gaussian approximations for every data point in the candidate memory. We first approximate the posterior distribution using both Dtk andMtk\u22121 :\nq\u0303\u03b8tk (w) = argmin q\u03b8\nKL [ q\u03b8(w) || Z\u0303\u22121tk q\u03b8tk\u22121 (w) p(Dtk |w) p(Mtk\u22121 |w) ] . (7)\nNext, we use Eqs. (5a) and (5b) to calculate the natural parameters of all Gaussian terms. In practice, we estimate the natural parameters using (unbiased) Monte-Carlo estimators for the expectations. We have now available the likelihood terms and their Gaussian approximations. This allows us to write L(\u03b8tk ;Dtk \u222aMtk\u22121) in the form of Eq. (6):\nL(\u03b8tk ; Dtk \u222a Mtk\u22121) = logZqtk + \u2211\ndtk \u2208Dtk\u222aMtk\u22121\nEq\u0303\u03b8tk (w) [ log p(dtk |w)\u2212 log rtk(w;dtk) ] ,\nwhere dtk are the samples in Dtk \u222aMtk\u22121 and where rtk(w;dtk) are the Gaussian approximation of the corresponding likelihood terms. Note that rtk does not only depend on dtk , however, we omit the dependence on the remaining data for notational convenience.\nIf the likelihood p(dtk |w) is close to the Gaussian rtk(w;dtk) in expectation w.r.t. the approximate posterior q\u03b8tk (w), then its contribution to L(\u03b8tk ;Dtk \u222aMtk\u22121) is small. Similarly, likelihood terms that cannot be well approximated by the respective Gaussian have a large contribution, and, hence, the corresponding data should be kept in the memory. For this reason, we propose the score function\nStk(M; Dtk \u222a Mtk\u22121) = \u2211\ndtk \u2208M\nEq\u0303\u03b8tk (w) [ log p(dtk |w)\u2212 log rtk(w;dtk) ] , (8)\nand the corresponding memory updateMtk = argmaxM Stk(M; Dtk \u222a Mtk\u22121). Note that since all residual terms are computed independently, the update results in selecting the top M terms. 3.3 GAUSSIAN UPDATE\nThe Gaussian update follows from the memory update presented in the previous section: once the memoryMtk has been selected, we update the Gaussian distribution with the approximations corresponding to the rest of the data Dtk \u222aMtk\u22121 \\Mtk . We can update q\u03b8tk (w) in two equivalent ways:\nq\u03b8tk (w) = q\u03b8tk\u22121 (w) \u220f\ndtk 6\u2208Mtk\nrtk (w;dtk ), (9a) q\u03b8tk (w) = q\u0303\u03b8tk (w) / \u220f\ndtk \u2208Mtk\nrtk (w;dtk ). (9b)\nNote again that the natural parameters of rtk(w;dtk) are estimated using Monte Carlo and the products in the above equations imply a summation of the natural parameters. In order to reduce the variance of this sum of estimators, we use Eq. (9a) if |Dtk | \u2264 |Mtk |, and Eq. (9b) if |Dtk | > |Mtk |. Furthermore, we can compute the average bias from all natural parameter estimates (see App. C). We reduce the bias of our estimates by subtracting the average bias from all estimates. Note that a further option to update qtk(w) would be to use VB on the data Dtk \u222a Mtk\u22121 \\Mtk to compute the update q\u03b8tk (w) \u2248 q\u03b8tk\u22121 (w) p(Dtk \u222a Mtk\u22121 \\Mtk |w). The latter approach is numerically more stable but computationally more expensive. It also turned out that it is less favorable to the update using Eq. (9a) or Eq. (9b) in case of small datasets Dtk , because VB applied to BNNs with small datasets often leads to a poor fit.\nPrevious work hypothesised that this problem is an artifact of the ELBO and not an optimisation problem (Trippe & Turner, 2018; Turner & M. Sahani, 2011). We provide further evidence in Fig. 1, where we infer the posterior of a Bayesian neural network with VB, using 70 and 100 data samples respectively and compare it to posterior inference with MCMC. In case of VB with 70 samples, the posterior approximation yields a model that is almost linear. These difficulties of posterior inference with variational Bayes are especially problematic in case of the streaming data setting, where the number of observations at each time-step is typically very small. The Gaussian update proposed above can alleaviate the problem of having to train BNNs with small datasets. Specifically, we have Ntk +M instead of Ntk data points to find a better optimum of the ELBO. 4 VARIATIONAL BAYES WITH MODEL ADAPTATION\nThe incremental learning methods discussed so far assume i.i.d. data (cf. Sec. 2, Eq. (1)). This assumption can be reasonable even in scenarios with changing data distributions, e.g. when the data drift is an algorithmic artifact rather than a real phenomenon. For example, in online multi-task or curriculum learning we want to learn a model of all tasks, but we may choose to learn the tasks incrementally for various reasons (e.g. Nguyen et al., 2018; Kirkpatrick et al., 2017; Schwarz et al., 2018; Rusu et al., 2016; Fernando et al., 2017). However, such approaches are not applicable for modeling non-stationary data: one of the properties of online VB is that the variance of the Gaussian posterior approximation shrinks at a rate of O(N), where N is the total amount of data (e.g. Opper, 1998). Consequently, learning comes to a halt as t\u2192\u221e. To overcome this issue, the model needs to be extended by a method that enables it to adapt to changes in the data distribution, e.g., by deliberately forgetting the belief inferred from previous data.\nIn the following, we describe two alternative methods for adapting to changing data. In Sec. 4.1, we impose Bayesian exponential forgetting, which forgets previous data exponentially by weighting the likelihood terms (or their approximations). In Sec. 4.2, we implement the adaptation through a diffusion process applied to the neural network parameters. Compared to the online learning scenario, we make the following assumptions: (i) we observe datasets Dtk at potentially non-equidistant time steps tk; (ii) data within Dtk is assumed i.i.d., however, not between different datasets Dtk and Dtk+1 . In both approaches, we realise adaptation by an additional forgetting step before observing the new data Dtk+1 . We denote the distribution, which results from applying the forgetting step to the posterior approximation q\u03b8tk (w) p(Mtk |w) by ptk+1(w). 4.1 ADAPTATION WITH BAYESIAN FORGETTING\nModel adaptation through forgetting can be achieved by decaying the likelihood based on the temporal recency of the data (Graepel et al., 2010; Honkela & Valpola, 2003). It has been explored previously as an alternative to filtering and is referred to as Bayesian exponential forgetting (Kulhav\u00fd & Zarrop, 1993). This approach defines a forgetting operator that yields p(wtk+1 |Dt1:tk) directly. Here, we use a continuous-time version of this forgetting operation that can be formulated as\np(w|Dt1:tK ) \u221d p0(w) K\u220f k=1 p(Dtk |w)(1\u2212 ) tK\u2212tk \u03c4 , (10)\nwhere \u03c4 is a time-constant corresponding to the average of the time-lags \u2206tk+1 = tk+1 \u2212 tk. The distribution defined in Eq. (10) can be formulated recursively (cf. App. F) as\np(w|Dt1:tk+1) \u221d p0(w)1\u2212(1\u2212 ) \u2206tk+1/\u03c4 p(w|Dt1:tk)(1\u2212 ) \u2206tk+1/\u03c4\np(Dtk+1 |w). (11) This equation can be viewed as Bayes rule (Eq.(1)) applied after the forgetting step. The first two terms of Eq. (11) can be identified as the forgetting operation, applied to the current posterior. In\norder to apply this operation to our posterior approximation q\u03b8tk (w) p(Mtk |w), we modify it by an additional weighting factor for each likelihood term in the memory. Denoting the age of a memory item m by \u2206tk(m), the forgetting operation for this new posterior approximation then results in\nptk+1(w) \u221d p0(w)1\u2212(1\u2212 ) \u2206tk+1/\u03c4 \u00d7 [ q\u03b8tk (w) \u220f m\u2208Mtk p(m|w)(1\u2212 ) \u2206tk(m)/\u03c4 ](1\u2212 )\u2206tk+1/\u03c4\n= [ p0(w) 1\u2212(1\u2212 )\u2206tk+1/\u03c4 q\u03b8tk (w) (1\u2212 )\u2206tk+1/\u03c4 ] \u00d7 \u220f m\u2208Mtk p(m|w)(1\u2212 ) \u2206tk+1(m)/\u03c4 , (12)\nwhere \u2206tk+1(m) = \u2206tk(m) + \u2206tk+1. As can be seen from Eq. (12), BF acts on both factors of the posterior approximation independently: in case of the memory, it re-weights the respective likelihood terms by updating \u2206tk+1(m). For the Gaussian term q\u03b8tk (w), BF leads to a weighted product with the prior distribution (i.e. the first two terms of Eq. (12)), resulting in a Gaussian with parameters\n\u03c3\u22122tk+1 = ( 1\u2212 (1\u2212 )\u2206tk+1/\u03c4 ) \u03c3\u221220 + (1\u2212 ) \u2206tk+1/\u03c4\u03c3\u22122tk ,\n\u03c3\u22122tk+1\u00b5tk+1 = ( 1\u2212 (1\u2212 )\u2206tk+1/\u03c4 ) \u03c3\u221220 \u00b50 + (1\u2212 ) \u2206tk+1/\u03c4\u03c3\u22122tk \u00b5tk .\nFor \u2206tk+1 \u2192 \u221e, the likelihood term in Eq. (12) converges to the uniform distribution and the Gaussian term reverts to the prior. We note, however, that while Eq. (11) is an exact recursive form of Eq. (10), the online VB approximation of Eq. (11) is not generally identical to the (offline) VB approximation of Eq. (10) due to its successive approximations. For tuning the hyperparameter , we note that the weighting of likelihood terms corresponds to an effective dataset size of 1/ \u00b7N (if all datasets are of equal size N ). In Fig. 2, we also visualise the forgetting operation applied to the Gaussian part of the posterior approximation for multiple values of . 4.2 ADAPTATION WITH DIFFUSION PROCESSES\nModel adaptation can also be realised by using dynamic model parameters that evolve according to a stochastic process. In this case, adaptation is achieved by the stochastic transition ptk+1,tk(w\n\u2032|w) resulting in a prediction distribution\nptk+1(w \u2032) = \u222b ptk+1,tk(w \u2032|w) p(w|Dt1:tk) dw, (13)\nwhere we consider Gaussian transitions ptk+1,tk(w \u2032|w). However, this operation is generally not tractable for our posterior approximation q\u03b8tk (w) p(Mtk |w). Moreover, the forgetting operation implied by the transition does not retain the product form as in the case of BF. For this reason, we consider only a Gaussian posterior approximation (without memory) for this approach, that is ptk+1(w \u2032) = \u222b ptk+1,tk(w \u2032|w) q\u03b8tk (w) dw.\nAs mentioned in Sec. 4.1, BF yields the prior distribution for \u2206tk+1 \u2192 \u221e. This is a desirable property, since it corresponds to forgetting all information conveyed by the data. In case of a Gaussian prior, the only Gaussian process that fulfills this requirement is the Ornstein-Uhlenbeck (OU) process given by the stochastic differential equation dwt = \u03b8 \u00b7 (\u00b50 \u2212wt) dt+ \u03c30 \u221a 2\u03b8 dWt, where \u03b8 is the stiffness parameter which controls the drift rate towards \u00b50. To decouple the adaptation parameter from the rate at which data is observed, we rescale the stiffness parameter as \u03b8 = a/\u03c4 . The resulting prediction distribution ptk+1(w) = N ( \u00b5tk+1 , \u03c3 2 tk+1 ) is defined by the parameters\n\u00b5tk+1 = ( 1\u2212 e\u2212a \u2206tk+1 \u03c4 ) \u00b50 + e \u2212a \u2206tk+1 \u03c4 \u00b5tk ,\n\u03c32tk+1 = ( 1\u2212 e\u22122a \u2206tk+1 \u03c4 ) \u03c320 + e \u22122a \u2206tk+1 \u03c4 \u03c32tk .\nAn interesting observation is that both parameters evolve independently of each other. In contrast to BF, the mean and variance\u2014instead of the natural parameters\u2014follow an exponential decay. The hyperparameter a can be determined e.g. through the half-time of the exponential decay of the mean parameter, given as \u03c41/2 = 1/\u03b8. We visualise the time evolution of the above parameters in Fig. 2. 6 EXPERIMENTS\nWe validate our proposed inference methods in two stages. In Sec. 6.1, we compare our memory update and Gaussian update (Sec. 3) to existing memory-based online inference methods on several standard machine learning datasets. In Sec. 6.2, we evaluate our adaptation methods (Sec. 4) on commonly used datasets with concept drift (Widmer & Kubat, 1996), where the conditional distribution of labels given the features changes over time (i.e. non-stationary data in the context of predictive models).\nWe found that training (variational) Bayesian neural networks on streaming data is challenging, specifically, our approach requires model parameters very close to a local optimum since Eqs. (5a) and (5b) hold only at local extrema of the ELBO. To overcome these difficulties, we use several methods to reduce the variance of the gradient estimates for learning: (i) we apply the local reparametrisation trick (Kingma et al., 2015); (ii) we use the Adam optimiser (Kingma & Ba, 2014); and (iii) we use multiple Monte Carlo samples to estimate the gradients (cf. Tab. 2 for details). Furthermore, we developed methods for determining hyperparameters of the Gaussian prior and the initialisation distribution of Bayesian neural networks. The idea is similar to the initialisation method proposed by Glorot & Yoshua Bengio (2010) and He et al. (2015): we choose the prior and the posterior initialisation such that the mean and standard deviation of the activations in every layer are approximately zero and one, respectively. We refer to App. H and App. I for a derivation and further details.\nWe use the following metrics for evaluation: (i) the avg. test log-marginal likelihood (LML) N\u22121test \u2211 n logEq\u0303\u03b8tk (w) [ p(d (n) test|w) ] , where d(n)test is a sample from a heldout test dataset; (ii) the avg. one-\nstep-ahead LML N\u22121tk+1 \u2211 n logEq\u0303\u03b8tk (w) [ p(d (n) tk+1 |w) ] , where d(n)tk+1 is data observed at time-step tk+1. Both metrics measure the predictive performance, however (i) can be used in the online setting, where the data is i.i.d.; and (ii) is typically used to evaluate models with non-stationary streaming data. 6.1 ONLINE LEARNING\nIn this section, we evaluate our running memory (Sec. 3) in an online learning setting. To illustrate how our memory update works, we start our evaluation with a qualitative assessment: we train a model on 2-dimensional toy data (two-moons), where we can visualise the selected memory. The BNN has 2 layers with 16 units and tanh activations, and has a prior p0(w) = N (w; 0, 1) on all\nweights and biases. The memory-size is M = 30. The model observes 150 data samples at time-step t0 and 15 samples at all consecutive time-steps. In Fig. 3, we visualise the selected memory and the corresponding scores for time-steps t0, t25, and t50, respectively. We can make the empirical observation that our method favors data close to the decision boundary. Furthermore, in Fig. 3d, we visualise the one-step-ahead LML for data that will be selected and evicted (in the next time-step), respectively. This shows that our memory update tends to select data for which the model has a low predictive LML. These observations support our intuition that the memory is indeed complementary to the Gaussian approximation, selecting data for which the likelihood cannot be well approximated by a Gaussian function. In Fig. 9 of the supplementary material, we visualised the running memory for a model trained on MNIST, showing that the memory also accumulates diverse samples over time.\nWe evaluate our memory-based online inference method (Sec. 3) quantitatively on several standard machine learning datasets, including regression (UCI Boston, UCI Concrete, UCI Energy, UCI Yacht) and classification (MNIST, UCI Spam, UCI Wine) tasks. Here, we refer to our approach as Gaussian Residual Scoring (GRS). We compare GRS to the respective memory update and Gaussian update methods proposed in VCL (Nguyen et al., 2018) (cf. Sec. 2.2). Refer to App. B for an explanatory list of compared update methods. Online learning is performed by observing Ntk samples per time-step (cf. Tab. 2 for the experiment setup and hyperparameters.). For evaluation, we use a random held-out test dataset (20% of the data). We perform each experiment with 16 different random data splits and random seeds for the model parameter initialisation. In Fig. 4, we plot the test LML, averaged over the 16 runs, against the memory size, and the LML over all time-steps. In most cases, random selection and the k-center method start with a worse initial fit at t0. This is because these methods perform the initial Gaussian update by optimising the ELBO with Nt0 \u2212M samples at t0; by contrast, GRS uses a Gaussian update that first optimises the ELBO with Nt0 samples and subsequently discounts the contribution of the memory. In Tab. 1, we report the mean and std. deviation of the LML, where the mean and std. deviation are taken over the 16 independent runs, each averaged over the last 10% time-steps. The results demonstrate the superior predictive performance of our update methods. We also note that the experiments on the smaller datasets (cf. Tab. 2 in App. B) result in a high variance among the random data splits and random seeds. This is the case for all compared methods and it could not be remedied e.g. by using annealing or a different prior. 6.2 ADAPTATION\nIn this section, we evaluate our adaptation methods (Sec. 4) in settings with concept drift. We begin with a simple logistic regression problem, where the data Dtk = {(xtk ,ytk)}n, xtk \u2208 R2, ytk \u2208 R is sampled from xtk \u223c Uniform(\u22123, 3), ytk \u223c Bernoulli(\u03c3(wtkxtk)). The true model has two time-dependent parameters w0tk = 10 sin(\u03b1 \u00b7 tk), w 1 tk\n= 10 cos(\u03b1 \u00b7 tk), where \u03b1 = 5 deg /sec and where we observe data at tk \u2208 [0, 1, . . . , 720]. Fig. 5 shows the learned model parameters for standard online learning (without adaptation), OU process transitions, and Bayesian forgetting. If the time-dependence of the data is ignored (in case of online VB), the class labels are distributed with equal probability in the whole input space. Consequently, as t\u2192\u221e, the weights of the model without adaptation shrink to 0. By contrast, the posterior means of BF and the OU process follow a sinusoidal curve as the parameters of the true model.\nWe also evaluate our adaptation methods quantitatively on 3 datasets with concept drift (Weather, Gas Sensor Array Drift, Covertype). We compare online VB (without adaptation), the Wiener process (a special case of the OU process), the OU process, and Bayesian Forgetting (with and without memory). All compared variants use the same model architecture and hyperparameters (cf. Tab. 2 in the supplementary material). We report the one-step-ahead LML, where the expectation is approximated with 500 Monte Carlo samples. Results are averaged over the last 50% time-steps, because we are interested in the continual learning performance, and the first few time-steps will be similar for most methods. We report the mean and std. deviation over 8 independent runs with different random seeds. In Fig. 6 (and Fig. 10 in the appendix), we plot the LML against 10 adaptation parameter values (of the respective adaptation method), where the value zero corresponds to online VB. The LML for BF with different memory sizes and a fixed forgetting rate = 0.11 is shown in Fig. 6. As can be seen from the results, all adaptation methods significantly improve the performance compared to online VB. Interestingly, the Ornstein-Uhlenbeck process performs better than Bayesian Forgetting, however, using a running memory with Bayesian Forgetting closes the gap. 7 CONCLUSION\nIn this work, we have addressed online inference for non-stationary streaming data using Bayesian neural networks. We have focused on posterior approximations consisting of a Gaussian distribution and a complementary running memory, and we have used variational Bayes to sequentially update the posteriors at each time-step. Existing methods update these two components without having an interaction between them, and they lack methods to adapt to non-stationary data. We have proposed a novel update method, which treats both components as complementary, and two novel adaptation methods (in the context of Bayesian neural networks with non-stationary data), which gradually revert to the prior distribution if no new data is observed.\nFuture research could extend our work by drift detection methods and use them to infer the adaptation parameters. This work could also be extended by developing adaptation methods for gradual, abrupt, or recurring changes in the data distribution. Finally, we observed that variational Bayesian neural networks with a uni-modal approximate posterior often find poor local minima if the dataset is small and models are complex. This is especially challenging in scenarios with streaming data. While our Gaussian update alleviates this problem to a certain degree, further research in extending the approximation family beyond Gaussians could be beneficial. Progress in this direction would improve our proposed methods and allow to scale them to more complex models. 8 APPENDIX\n A FURTHER EXPERIMENTAL RESULTS\nA.1 MEMORY\nHere we provide additional experimental results for the memory update and Gaussian update from Sec. 3. We conducted experiments on 3 additional datasets (UCI Boston, UCI Yacht, UCI Red Wine). The influence of the memory size and the performance over time (for a specific memory size) are shown in Fig. 7 (corresponding to Fig. 4 in the main text).\nFurthermore, we test the memory update and Gaussian update of GRS separately on UCI Energy and UCI Concrete. For this purpose, i) we combine the k-center method with our Gaussian update from Sec. 3.3; and ii) we use our memory update from Sec. 3.2 and update the Gaussian distribution by optimizing Eq. (2) with Dtk \u222aMtk\u22121\\Mtk (re-fitting) . The results are shown in Fig. 8. As can be seen, GRS performs better than one of the components used in combination with a baseline method. GRS with refit performs especially bad, similar to the baselines k-center and random. This is because refitting requires optimising the ELBO with a small dataset. As mentioned in Sec. 3.3 (cf. Fig. 1), Bayesian neural networks with VB perform bad on small datasets due to over-regularisation. Consequently, in case of refitting, a good memory update can lead to a worse overall performance due to a much worse Gaussian update. While this general issue with Bayesian neural networks (learned with VB) is beyond the scope of this work, it is an important future research direction.\nTo better understand our memory update using the score function from Eq. (8), we visualise the running memory for a model trained on MNIST in Fig. 9.\nWe also evaluated our adaptation methods on 2 additional datasets (Gas Sensor Array Drift, Weather). In Fig. 10, we visualise the influence of the adaptation parameter for these datasets. Note that the range of the adaptation parameters is on a much smaller range compared to the experiments on Covertype (Sec. 6.2). For larger values, the performance starts to degrade. Surprisingly, the memory degrades the performance in case of the Gas Sensor Array Drift dataset.\nA.3 CATASTROPHIC FORGETTING WITH ONLINE VB AND BAYESIAN NEURAL NETWORKS\nHere we provide further experimental results for the behavior of online VB (Secs. 2.1, 3) in case of non-stationary data. For this purpose, we train Bayesian neural networks with different architectures on the toy classification problem with a rotating decision boundary from Sec. 6.2, however, with 150 data samples per time-step. In Fig. 11, we visualise the training LML for different architectures, including a linear model. It can be seen that Bayesian neural networks with higher complexity (i.e. more layers or more units) drop slower in performance compared to the linear model. However, this is not a desired property for online VB, since exact online Bayesian inference would yield the same posterior distribution as offline Bayesian inference. In case of our toy classification data (where the time dependence is ignored), online inference should not be able to classify the data as t\u2192\u221e. Instead, this learning behavior shows that online VB with Gaussian approximate posterior distributions is prone to catastrophic forgetting. B EXPERIMENT SETUP\nThe following is an explanatory list of the update methods used in Sec. 6.1:\n\u2022 k-center (VCL): Uses the k-center method (Sec.2.2) for the memory update and Eq. (2) with (Dtk \u222a Mtk\u22121) \\Mtk for the Gaussian update.\n\u2022 random (VCL): Uses random selection (Sec.2.2) for the memory update and Eq. (2) with (Dtk \u222a Mtk\u22121) \\Mtk for the Gaussian update.\n\u2022 GRS (Gaussian Residual Scoring, ours): Uses Eq. (8) for the memory update (Sec. 3.2) and performs the Gaussian update by first using Eqs. (2) with (Dtk \u222a Mtk\u22121) and subsequently using Eqs. (5a), (5b) for removing the local contributions ofMtk (cf. Sec. 3.3).\nSimilarly, the following list summarises the adaptation methods used in Sec. 6.2:\n\u2022 Wiener process: Posterior approximation consists of q\u03b8tk (w) only. Transition p(wtk+1 |wtk) is given by a random walk. We used a diffusion that is proportional to the prior standard deviation in every neural network layer (cf. Sec. 4.2). No memory used.\n\u2022 Ornstein-Uhlenbeck process: Posterior approximation consists of q\u03b8tk (w) only. Transition p(wtk+1 |wtk) is given by the Ornstein-Uhlenbeck process (cf. Sec. 4.2). No memory used.\n\u2022 Bayesian forgetting: Posterior approximation consists of q\u03b8tk (w) only. No state-space model assumption, instead uses Bayesian exponential forgetting (cf. Sec. 4.1).\n\u2022 Bayesian forgetting with memory: Posterior approximation consists of q\u03b8tk (w) andMt. No state-space model assumption, instead uses Bayesian exponential forgetting (cf. Sec. 4.1).\nIn Tab. 2, we summarise experimental setup (hyperparameters) used for Secs. 6.1 and 6.2. C FACTORISATION PROPERTY OF THE GAUSSIAN VARIATIONAL APPROXIMATION\nHere we derive the factorisation property of the Gaussian variational approximation distribution by expressing the natural parameters of the Gaussian approximation as a sum. This can be shown for the Gaussian approximation at a local optimum of the ELBO. For a Gaussian prior and posterior the\nELBO is given as\nL(\u00b5\u2217,\u03a3\u2217) = \u22121 2\n( log |\u03a30| \u2212 log |\u03a3\u2217| \u2212 d+ (\u00b5\u2217 \u2212 \u00b50)T\u03a3\u221210 (\u00b5\u2217 \u2212 \u00b50) + Tr(\u03a3\u2217\u03a3 \u22121 0 ) )\n+ N\u2211 n=1 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ] .\nAt a local optimum, we have \u2202L(\u00b5 \u2217,\u03a3\u2217)\n\u2202\u00b5\u2217 = 0, which yields\nN\u2211 n=1 \u2202 \u2202\u00b5\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ] = \u03a3\u221210 (\u00b5 \u2217 \u2212 \u00b50).\nHence, we obtain\n\u00b5\u2217 = \u00b50 + \u03a30 N\u2211 n=1 \u2202 \u2202\u00b5\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ] . (14)\nSimilarly, we have \u2202L(\u00b5 \u2217,\u03a3\u2217)\n\u2202\u03a3\u2217 = 0, which yields\nN\u2211 n=1 \u2202 \u2202\u03a3\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ] = \u22121 2 (\u03a3\u2217)\u22121 + 1 2 \u03a3\u221210 . (15)\nHence, we obtain\n\u03a3\u2217 = ( \u03a3\u221210 \u2212 2 N\u2211 n=1 \u2202 \u2202\u03a3\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ])\u22121 . (16)\nNext, we calculate the natural parameters from Eqs. (14), (16):\n\u039b\u2217 = \u039b0 + N\u2211 n=1 \u22122 \u2202 \u2202\u03a3\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ]\n= \u039b0 + N\u2211 n=1 \u039b(n).\n\u03b7\u2217 = \u039b\u2217\u00b5\u2217 = (\n\u039b0 + N\u2211 n=1 \u039b(n) ) \u00b5\u2217\n= \u039b0 ( \u00b50 + \u03a30 N\u2211 n=1 \u2202 \u2202\u00b5\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ]) + N\u2211 n=1 \u039b(n)\u00b5\u2217\n= \u039b0\u00b50 + N\u2211 n=1 ( \u2202 \u2202\u00b5\u2217 Ew\u223cq\u03b8\u2217 (w) [ log p(d(n)|w) ] + \u039b(n)\u00b5\u2217 )\n= \u03b70 + N\u2211 n=1 \u03b7(n).\nMonte Carlo estimation: The natural parameters \u039b(n), \u03b7(n) can be estimated with Monte Carlo, by replacing the expectation with an empirical mean. Since the parameters \u039b\u2217 and \u039b0 (and \u03b7\u2217, \u03b70 respectively) are known, the total bias of the parameter estimates can be computed:\n\u039bb = (\u039b \u2217 \u2212 \u039b0)\u2212 N\u2211 n=1 \u039b(n), \u03b7b = (\u03b7 \u2217 \u2212 \u03b70)\u2212 N\u2211 n=1 \u03b7(n).\nWe use this to reduce the bias for the individual terms:\n\u039b\u0302(n) = \u039b(n) \u2212 1 N \u039bb, \u03b7\u0302 (n) = \u03b7(n) \u2212 1 N \u03b7b. D ELBO IN RESIDUALS FORM\nHere we show how the ELBO can be written in the form of Eq. (6). Let us define the variational distribution in the factorised form q\u03b8(w) = Z\u22121q p(w) \u220fN n=1 r\n(n)(w) (cf. Sec.3.1). We can then write the ELBO as\nL(\u00b5,\u03a3;D) = Eq\u03b8(w) [ N\u2211 n=1 log p(d(n)|w) + log p(w)\u2212 log q\u03b8(w) ]\n= Eq\u03b8(w) [ N\u2211 n=1 ( log p(d(n)|w) + log r(n)(w)\u2212 log r(n)(w) ) + log p(w)\u2212 log q\u03b8(w) ]\n= Eq\u03b8(w) [ N\u2211 n=1 ( log p(d(n)|w)\u2212 log r(n)(w) ) + log p(w) \u220fN n=1 r (n)(w) q\u03b8(w) ]\n= Eq\u03b8(w) [ N\u2211 n=1 ( log p(d(n)|w)\u2212 log r(n)(w) ) + log Zq \u00b7 q\u03b8(w) q\u03b8(w) ] = logZq +\n\u2211 n Eq\u03b8(w) [ log p(d(n)|w)\u2212 log r(n)(w) ] . E MEMORY UPDATE SCORE FUNCTION\nIn Eq. 8, the expectation involving Gaussian terms can be calculated analytically: Eq\u0303\u03b8tk (w) [ log r (m) tk (w) ] = \u222b q\u0303\u03b8tk (w) ( \u03b7(n)w \u2212 1 2 \u039b(n)w2 ) dw\n= \u03b7(n)\u00b5(n) \u2212 1 2\n\u039b(n) ( (\u00b5\u2217)2 + \u03a3\u2217 ) = \u03b7(n)(\u039b\u2217)\u22121\u03b7\u2217 \u2212 1\n2 \u039b(n)\n( (\u039b\u2217)\u22121\u03b7\u2217 )2 \u2212 1 2 \u039b(n)(\u039b\u2217)\u22121\nThe expectation involving non-Gaussian terms (in Eq. 8) has no closed-form solution. We therefore estimate Eq\u0303\u03b8tk (w)\n[ log p(d (n) tk |w) ] using Monte-Carlo. F BAYESIAN FORGETTING - RECURSIVE FORMULATION\nHere we show how Bayesian forgetting can be rearranged into a recursive formulation. We first bring this formula into a similar form as Eq. (1), extracting the most recent likelihood term:\np(w|Dt1:tK+1) \u221d p0(w) \u00b7 K+1\u220f k=1 p(Dtk |w)(1\u2212 ) tK+1\u2212tk \u03c4\n= p0(w) \u00b7 K\u220f k=1 p(Dtk |w)(1\u2212 ) tK+1\u2212tk \u03c4 \u00b7 p(DtK+1 |w).\nThe first two terms can be rewritten as\np0(w) \u00b7 K\u220f k=1 p(Dtk |w)(1\u2212 ) tK+1\u2212tk \u03c4 = p0(w) \u00b7 K\u220f k=1 p(Dtk |w)(1\u2212 ) tK+1\u2212tK+tK\u2212tk \u03c4\n= p0(w) \u00b7 K\u220f k=1 p(Dtk |w)(1\u2212 ) tK\u2212tk \u03c4 \u00b7(1\u2212 ) tK+1\u2212tK \u03c4\n= p0(w) \u00b7 ( K\u220f k=1 p(Dtk |w)(1\u2212 ) tK\u2212tk \u03c4 )(1\u2212 ) tK+1\u2212tK\u03c4\n\u221d p0(w) \u00b7 ( p(w|Dt1:tK ) p0(w) )(1\u2212 ) tK+1\u2212tK\u03c4 = p0(w) 1\u2212(1\u2212 ) tK+1\u2212tK \u03c4 \u00b7 p(w|Dt1:tK )(1\u2212 ) tK+1\u2212tK \u03c4\nHence, we have shown that the posterior can be expressed recursively as\np(w|Dt1:tk+1) \u221d p0(w)1\u2212(1\u2212 ) \u2206tk+1/\u03c4 p(w|Dt1:tk)(1\u2212 ) \u2206tk+1/\u03c4\np(Dtk+1 |w). The parameters of the Gaussian part q\u03b8tk (w) of the posterior approximation (after applying the forgetting operation) can be calculated easily from the above equation.\nNatural parameters:\n\u039btk+1 = \u039b0 \u00b7 (1\u2212 (1\u2212 ) tk+1\u2212tk \u03c4 ) + \u039btk \u00b7 ((1\u2212 ) tk+1\u2212tk \u03c4 )\n\u03b7tk+1 = \u03b70 \u00b7 (1\u2212 (1\u2212 ) tk+1\u2212tk \u03c4 ) + \u03b7tk \u00b7 ((1\u2212 ) tk+1\u2212tk \u03c4 )\nCovariance parameter: \u03a3tk+1 = ( \u039b0 \u00b7 (1\u2212 (1\u2212 ) tk+1\u2212tk \u03c4 ) + \u039btk \u00b7 ((1\u2212 ) tk+1\u2212tk \u03c4 ) )\u22121\n= ( \u03a3\u221210 \u00b7 (1\u2212 (1\u2212 ) tk+1\u2212tk \u03c4 ) + \u03a3\u22121tk \u00b7 ((1\u2212 ) tk+1\u2212tk \u03c4 ) )\u22121\nLocation parameter:\n\u00b5tk+1 = \u03a3tk+1 \u00b7 \u03b7tk+1\n= \u03a3tk+1 ( \u03b70 \u00b7 (1\u2212 (1\u2212 ) tk+1\u2212tk \u03c4 ) + \u03b7tk \u00b7 ((1\u2212 ) tk+1\u2212tk \u03c4 ) ) = \u03a3tk+1 ( \u03a3\u221210 \u00b50 \u00b7 (1\u2212 (1\u2212 ) tk+1\u2212tk \u03c4 ) + \u03a3\u22121tk \u00b5tk \u00b7 ((1\u2212 ) tk+1\u2212tk \u03c4 )\n) G PSEUDO-ALGORITHM\nWe provide the pseudo algorithm of GRS (Sec. 3) with Bayesian forgetting in Alg. 1. The computational complexity (at each of the K time-steps) is dominated by i) the minimisation of the KL\ndivergence, ii) estimating the Gaussian factors, and iii) scoring the memory. The KL minimisation requires Itk sequential iterations with Ntk +M data samples and Ktrain Monte Carlo samples. The latter can both be processed in parallel on parallel hardware. The estimation of the Gaussian factors requires Ntk +M sequential iterations with Kterm parallel Monte Carlo samples. The dominating computation for calculating the scores is the evaluation of the likelihood for Ntk +M data samples and Ktrain Monte Carlo samples, both of which can be processed in parallel. The highest scoring candidate memory is given by the top-M highest scoring data points, thus, the computational complexity of Eq. (8) is only linear in the number of samples.\nAlgorithm 1 Gaussian Residual Scoring with Bayesian forgetting. The function EstimateGaussianFactors corresponds to Eqs. (5a), (5b) (cf. also App. C). The function ApplyForgetting corresponds to Eq. (12). Note that ptk(w) includes the adapted likelihood of the memory and all subsequent functions involving the memory use this adapted likelihood.\nInputs: p0(w), q\u03b8t0 (w), \u03c4 , K for k in [0...K] do tk = GetTimeStamp(k) \u2206tk = tk \u2212 tk\u22121 Dtk = GetData(tk) if k == 0 then ptk(w) = p0(w)\nelse ptk(w) = ApplyForgetting ( p0(w), q\u03b8tk\u22121 ,Mtk\u22121 ,\u2206tk ) {Sec. 4.1} end if q\u0303\u03b8tk (w) = argminq\u03b8 KL [ q\u03b8(w) || Z\u0303\u22121tk ptk(w) p(Dtk |w) ] {Sec. 3.2, Sec. 4.1}\n{rtk(w;dtk)}dtk\u2208Dtk\u222aMtk\u22121 = EstimateGaussianFactors ( q\u0303\u03b8tk (w),Dtk ,Mtk\u22121 ) {Sec. 3.1} Mtk = argmaxM Stk(M; Dtk \u222a Mtk\u22121) {Sec. 3.2} if |Dtk | \u2264 |Mtk | then q\u03b8tk (w) = ptk(w) \u220f dtk 6\u2208Mtk\nrtk(w;dtk) {Sec. 3.3} else q\u03b8tk (w) = q\u0303\u03b8tk (w) / \u220f dtk \u2208Mtk\nrtk(w;dtk) {Sec. 3.3} end if\nend for H PRIOR PARAMETERS\nHere we develop a heuristic to choose the initial prior p0(w). As this will not be specific to the online or continual setting, we drop the time index in this section, denoting the prior as p(w). Furthermore, we consider only Gaussian distributions with a diagonal covariance matrix. Assume that the data is standardised, that is, the first two moments are zero and one. A reasonable choice for the prior parameters is such that the first two moments of the prior-predictive distribution equals the first two moments of the data distribution. We go one step further and constrain the pre-activations of every neural network layer to have moments zero and one. Denote all weight matrices and weight biases by w = {Wl}l \u222a {bl}l, and let x0 denote the input data. Let us further denote the pre-activation (before non-linearity) of layer l and unit i as follows.\nxil = \u2211 j Wi,jl fl\u22121(x j l\u22121) + b i l.\nThe constraints are then given as follows. Ew\u223cp(w) [ Ex0\u223cp(D) [ xil ]] = 0, Ew\u223cp(w) [ Ex0\u223cp(D) [ (xil) 2 ]] = 1.\nThe first constraint can be easily fulfilled by setting the prior mean to zero for all parameters.\n\u00b5i,jl = 0.\nThis follows immediately from Wl \u22a5 fl\u22121(xl\u22121) and the expectation of products of independent random variables. The second moment can then be calculated as follows.\nEw\u223cp(w),x0\u223cp(D) [ (xil) 2 ] = Ew\u223cp(w) [ Ex0\u223cp(D) [ (xil) 2 ]]\n= Ew\u223cp(w) [ Varx0\u223cp(D) [ xil ] + 0 ]\n= Ew\u223cp(w) [ Varx0\u223cp(D) [Nl\u22121\u2211 j Wi,jl fl\u22121(x j l\u22121) + b i l ]]\n= Ew\u223cp(w) [ Nl\u22121\u2211 j ( Wi,jl )2 \u00b7Varx0\u223cp(D) [ fl\u22121(x j l\u22121) ]]\n= Nl\u22121\u2211 j Ew\u223cp(w) [( Wi,jl )2 \u00b7Varx0\u223cp(D) [ fl\u22121(x j l\u22121) ]]\n= Nl\u22121\u2211 j Ew\u223cp(w) [( Wi,jl )2] \u00b7 Ew\u223cp(w) [ Varx0\u223cp(D) [ fl\u22121(x j l\u22121) ]]\n=: Nl\u22121\u2211 j Ew\u223cp(w) [( Wi,jl )2] \u00b7 cfl\u22121 = cfl\u22121 \u00b7 Nl\u22121\u2211 j Varw\u223cp(w) [ Wi,jl\n] = Nl\u22121 \u00b7 cfl\u22121 \u00b7Varw\u223cp(w) [ Wi,jl ] .\nHere we introduced cfl\u22121 to denote a correction factor for the non-linearity fl\u22121. In case of the linear function, we will have cfl\u22121 = 1. For arbitrary non-linearities, we can estimate this factor numerically, assuming that the pre-activations are distributed according to N (0, 1).\ncfl\u22121 = Varxjl\u22121\u223cN (0,1)\n[ fl\u22121(x j l\u22121) ] This can be done beforehand and the factors for common activation functions can be stored in a lookup table. Finally, plugging in the constraint for the second moment in the above equation, we obtain the following prior variance.(\n\u03c3i,jl )2 = Varw\u223cp(w) [ Wi,jl ] =\n1\nNl\u22121cfl\u22121 (17) I POSTERIOR INITIALISATION\nIt is known that a proper initialisation of standard neural networks is crucial for the optimisation process Glorot & Yoshua Bengio (2010); He et al. (2015). In Bayesian neural networks, the matter becomes even more complicated, since we have to deal additionally with the variance of the Monte Carlo estimator due to re-parametrisation. Analogous to the choice of prior parameters, we seek a posterior initialisation that yields the first two moments of zero and one. A naive attempt would be to initialise the posterior with the prior parameters. However, the significant noise in the Monte Carlo estimation typically leads to bad optimisation results and even numerical instabilities. We propose an initialisation method which fulfills our constraints but allows us determine the variance of the initial posterior with two hyperparameters \u03b1 and \u03b2.\nLet us denote the mean and log-scale parameters of the approximate posterior as \u03b8 = {\u03b8\u00b5, \u03b8log \u03c3}. We choose the following initialisation distributions.\nq(w) = N (w; \u03b8\u00b5, e2\u03b8log \u03c3 ), where\np(\u03b8\u00b5) = N ( \u03b8\u00b5;\u00b5\u03b8\u00b5 , \u03c3 2 \u03b8\u00b5 ) ,\nand p(\u03b8log \u03c3) = N ( \u03b8log \u03c3;\u00b5\u03b8log \u03c3 , \u03c3 2 \u03b8log \u03c3 ) .\nHere and in the following, we dropped the time index for the approximate posterior, as well as the indices l, i, and j for the model parameters \u03b8.\nWe follow a similar derivation as in Sec. H. As for the prior, the mean of the initialisation distribution will be zero for all parameters.\n\u00b5\u03b8\u00b5 = 0.\nFor the second moment, the derivation is as follows.\nE\u03b8\u223cp(\u03b8),w\u223cq(w|\u03b8),x0\u223cp(D) [ (xil) 2 ] = E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [ Ex0\u223cp(D) [( xil )2]]]\n= E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [ Varx0\u223cp(D) [Nl\u22121\u2211 j Wi,jl \u00b7 fl\u22121(x j l\u22121) ] + 0 ]]\n= E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [Nl\u22121\u2211 j (Wi,jl ) 2 ] Varx0\u223cp(D) [ xjl\u22121 ]]\n= E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [Nl\u22121\u2211 j (Wi,jl ) 2 ]] \u00b7 E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [ Varx0\u223cp(D) [ fl\u22121(x j l\u22121) ]]]\n= Nl\u22121\u2211 j E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [ (Wi,jl ) 2 ]] \u00b7 E\u03b8\u223cp(\u03b8) [ Ew\u223cq(w|\u03b8) [ Varx0\u223cp(D) [ fl\u22121(x j l\u22121) ]]]\n=: Nl\u22121\u2211 j E\u03b8\u223cp(\u03b8) [ \u03b82\u00b5 + e 2\u00b7\u03b8log \u03c3 ] \u00b7 cfl\u22121\n= Nl\u22121 \u00b7 cfl\u22121 \u00b7 E\u03b8\u223cp(\u03b8) [ \u03b82\u00b5 + e 2\u00b7\u03b8log \u03c3 ]\n= Nl\u22121 \u00b7 cfl\u22121 \u00b7 ( E\u03b8\u223cp(\u03b8) [ \u03b82\u00b5 ] + E\u03b8\u223cp(\u03b8) [ e2\u00b7\u03b8log \u03c3 ]) = Nl\u22121 \u00b7 cfl\u22121 \u00b7 ( E\u03b8\u223cp(\u03b8) [ \u03b8\u00b5 ]2 + Var\u03b8\u223cp(\u03b8) [ \u03b8\u00b5 ] + e2E[\u03b8log \u03c3]+2Var[\u03b8log \u03c3 ] )\n= Nl\u22121 \u00b7 cfl\u22121 \u00b7 ( \u00b52\u03b8\u00b5 + \u03c3 2 \u03b8\u00b5 + e 2\u00b5\u03b8log \u03c3+2\u03c3 2 \u03b8log \u03c3 ) = Nl\u22121 \u00b7 cfl\u22121 \u00b7 ( \u03c32\u03b8\u00b5 + e 2\u00b5\u03b8log \u03c3+2\u03c3 2 \u03b8log \u03c3 ) .\nHence, the second constraint is as follows.\n1\nNl\u22121 \u00b7 cfl\u22121 = \u03c32\u03b8\u00b5 + e\n2\u00b5\u03b8log \u03c3+2\u03c3 2 \u03b8log \u03c3 .\nIn contrast to Sec. H, we are now under-constrained by 2 parameters. We therefore introduce two hyperparameters \u03b1 and \u03b2. We first determine \u03b1 := \u03c3\u03b8log \u03c3 , for which we generally choose small values \u03b1 \u2248 0 (\u03b1 = 0 corresponds to initialising all posterior variances in the given layer with the same value). The second hyperparameter \u03b2 \u2208 [0, 1] determines how much of the total variance is due to the variance of the location parameter and how much variance is due to the variance of the log-scale parameter. Inserting \u03b1 and introducing \u03b2 we obtain the following equations.\n\u03c32\u03b8\u00b5 = \u03b2\nNl\u22121 \u00b7 cfl\u22121 ,\nand e2\u00b5\u03b8log \u03c3+2\u03b1 2 = 1\u2212 \u03b2\nNl\u22121 \u00b7 cfl\u22121 .\nSolving the last equation for \u00b5\u03b8log \u03c3 , the result is as follows.\n\u00b5\u03b8log \u03c3 = 1\n2 log\n(1\u2212 \u03b2) \u00b7 e\u22122\u03b12\nNl\u22121 \u00b7 cfl\u22121 We choose \u03b1 = 0.001 and \u03b2 = 0.999 in all experiments.\nA note on the relation to initialisation methods for deterministic neural networks. Our result is similar to the initialisation methods from Glorot & Yoshua Bengio (2010) and He et al. (2015). The difference is in the correction factor cfl\u22121 . Whereas Glorot & Yoshua Bengio (2010) considers linear functions (or tanh in the linear regime), both methods base their derivation on the assumption that every data sample x0 is processed by a different, random neural network with independent weights, drawn from the initialisation distribution. The assumption is made explicit in (He et al., 2015) by the use of the variance of products of independent variables rule. We note that this assumption is false for both the initialisation of deterministic neural networks, as well as the graphical model assumption in Bayesian neural networks. Consequently, (He et al., 2015) obtains different correction factors (in their case for relu and leaky relu), taking into account the mean after the forward-pass through the non-linearity.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: On the expressive power of deep learning: A tensor analysis , Authors: Nadav Cohen Or Sharir... , 2016\n2.  Title: Boosting dilated convolutional networks with mixed tensor decompositions , Authors: Nadav Cohen Ronen Tamari... , 2018\n3.  Title: Hierarchical singular value decomposition of tensors , Authors: Lars Grasedyck , 2029\n4.  Title: Foundations of the PARAFAC procedure: Models and conditions for an \u201dexplanatory\u201d multimodal factor analysis , Authors: Richard A Harshman , 1970\n5.  Title: Expressive power of recurrent neural networks , Authors: Valentin Khrulkov Alexander Novikov... , 2018\n6.  Title: Speedingup convolutional neural networks using fine-tuned cp-decomposition , Authors: Vadim Lebedev Yaroslav Ganin... , 2015\n7.  Title: Tensorizing neural networks , Authors: Alexander Novikov Dmitrii Podoprikhin... , 2015\n8.  Title: Tensor-train decomposition , Authors: Ivan V Oseledets , 2011\n9.  Title: On multiplicative integration with recurrent neural networks , Authors: Yuhuai Wu Saizheng Zhang... , 2016\n10.  Title: Tensor-train recurrent neural networks for video classification , Authors: Yinchong Yang Denis Krompass... , 2017\n11.  Title: Long-term forecasting using tensor-train RNNs , Authors: Rose Yu Stephan Zheng... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Recurrent Neural Networks (RNNs) are very successful at solving challenging problems with sequential data. However, this observed efficiency is not yet entirely explained by theory. It is known that a certain class of multiplicative RNNs enjoys the property of depth efficiency \u2014 a shallow network of exponentially large width is necessary to realize the same score function as computed by such an RNN. Such networks, however, are not very often applied to real life tasks. In this work, we attempt to reduce the gap between theory and practice by extending the theoretical analysis to RNNs which employ various nonlinearities, such as Rectified Linear Unit (ReLU), and show that they also benefit from properties of universality and depth efficiency. Our theoretical results are verified by a series of extensive computational experiments.", "Tensor methods have a rich history of successful application in machine learning. (Vasilescu & Terzopoulos, 2002), in their framework of TensorFaces, proposed to treat facial image data as multidimensional arrays and analyze them with tensor decompositions, which led to significant boost in face recognition accuracy. (Bailey & Aeron, 2017) employed higher-order co-occurence data and tensor factorization techniques to improve on word embeddings models. Tensor methods also allow to produce more accurate and robust recommender systems by taking into account a multifaceted nature of real environments (Frolov & Oseledets, 2017). In recent years a great deal of work was done in applications of tensor calculus to both theoretical and practical aspects of deep learning algorithms. (Lebedev et al., 2015) represented filters in a convolutional network with CP decomposition (Harshman, 1970; Carroll & Chang, 1970) which allowed for much faster inference at the cost of a negligible drop in performance. (Novikov et al., 2015) proposed to use Tensor Train (TT) decomposition (Oseledets, 2011) to compress fully\u2013connected layers of large neural networks while preserving their expressive power. Later on, TT was exploited to reduce the number of parameters and improve the performance of recurrent networks in long\u2013term forecasting (Yu et al., 2017) and video classification (Yang et al., 2017) problems. In addition to the practical benefits, tensor decompositions were used to analyze theoretical aspects of deep neural nets. (Cohen et al., 2016) investigated a connection between various network architectures and tensor decompositions, which made possible to compare their expressive power. Specifically, it was shown that CP and Hierarchial Tucker (Grasedyck, 2010) decompositions correspond to shallow networks and convolutional networks respectively. Recently, this analysis was extended by (Khrulkov et al., 2018) who showed that TT decomposition can be represented as a recurrent network with multiplicative connections. This specific form of RNNs was also empirically proved to provide a substantial performance boost over standard RNN models (Wu et al., 2016). First results on the connection between tensor decompositions and neural networks were obtained for rather simple architectures, however, later on, they were extended in order to analyze more practical deep neural nets. It was shown that theoretical results can be generalized to a large class of CNNs with ReLU nonlinearities (Cohen & Shashua, 2016) and dilated convolutions (Cohen et al., 2018), providing valuable insights on how they can be improved.", "The authors extend the theoretical results of a paper previously presented in the last edition of ICLR (2018), where it was demonstrated that Recurrent Neural Network can be interpreted as a tensor network decomposition based on the Tensor-Train (TT, Oseledets et al, 2011).\n\nThe paper is well written and have a good structure. However, I found that some deep concepts are not well introduced, and maybe other more trivial results are discussed with unnecessary details. The following comments could help authors to improve the quality of presentation of their paper:\n-\tSection 3.1 (Score Functions and Feature Tensor) is a bit short and difficult to read. \no\tMaybe, a more motivating introduction could be included in order to justify the definition of score functions (eq. 2). \no\tIt would be also nice to state that, according to eq. (3), the feature tensor is a rank-1 tensor. \no\tI would suggest moving the definition of outer product to the Appendix, since most readers know it very well.\no\tIt is said that eq. 2 possesses the universal approximation property (it can approximate any function with any prescribed precision given sufficiently large M). It is not clear which is the approximation function.\n-       When scores functions are presented (eq. 2) it is written the term \"logits\" between brackets. Could you please clarify why this term is introduced here? Usually, logit of a probability p is defined as L(p)=p/(1-p). What is the usage of this term in this work? \n-      I think the theory is presented for a model with the two-classes only but used for multiple classes in the experimental sections. It should be necessary to make some comment about this in the paper.\n-      Details about how the RNN based on TT is applied must be added. More specifically, the authors should provide answers to clarify the following questions: \n(i) Are patches overlapped or non-overlapped? \n(ii) What value of M is used? and is there any general rule for this choice? \n(iii) How the classification in the 10-classes is obtained? Are you using a softmax function in the last layer? Are you using one weight tensor W_c per class (c=1,2,...,10). Please provide these technical details. \n(iv) Please, specify which nonlinear activation sigma is used in the feature map f_\\theta(x).\n(v) How many feature maps are used? and, Are the matrix A and vector b learned from training dataset or only the TT-cores need to be learned? \n7: Good paper, accept"], "input": "1 INTRODUCTION\nRecurrent Neural Networks are firmly established to be one of the best deep learning techniques when the task at hand requires processing sequential data, such as text, audio, or video (Graves et al., 2013; Mikolov et al., 2011; Gers et al., 1999). The ability of these neural networks to efficiently represent a rich class of functions with a relatively small number of parameters is often referred to as depth efficiency, and the theory behind this phenomenon is not yet fully understood. A recent line of work (Cohen & Shashua, 2016; Cohen et al., 2016; Khrulkov et al., 2018; Cohen et al., 2018) focuses on comparing various deep learning architectures in terms of their expressive power.\nIt was shown in (Cohen et al., 2016) that ConvNets with product pooling are exponentially more expressive than shallow networks, that is there exist functions realized by ConvNets which require an exponentially large number of parameters in order to be realized by shallow nets. A similar result also holds for RNNs with multiplicative recurrent cells (Khrulkov et al., 2018). We aim to extend this analysis to RNNs with rectifier nonlinearities which are often used in practice. The main challenge of such analysis is that the tools used for analyzing multiplicative networks, namely, properties of standard tensor decompositions and ideas from algebraic geometry, can not be applied in this case, and thus some other approach is required. Our objective is to apply the machinery of generalized tensor decompositions, and show universality and existence of depth efficiency in such RNNs. 3 ARCHITECTURES INSPIRED BY TENSOR DECOMPOSITIONS\nLet us now recall the known results about the connection of tensor decompositions and multiplicative architectures, and then show how they are generalized in order to include networks with ReLU nonlinearities. 3.1 SCORE FUNCTIONS AND FEATURE TENSOR\nSuppose that we are given a dataset of objects with a sequential structure, i.e. every object in the dataset can be written as\nX = ( x(1),x(2), . . . ,x(T ) ) , x(t) \u2208 RN . (1)\nWe also introduce a parametric feature map f\u03b8 : RN \u2192 RM which essentially preprocesses the data before it is fed into the network. Assumption 1 holds for many types of data, e.g. in the case of natural images we can cut them into rectangular patches which are then arranged into vectors x(t). A typical choice for the feature map f\u03b8 in this particular case is an affine map followed by a nonlinear activation: f\u03b8(x) = \u03c3(Ax + b). To draw the connection between tensor decompositions and feature tensors we consider the following score functions (logits1):\n`(X) = \u3008W ,\u03a6(X)\u3009 = (vecW)> vec \u03a6(X), (2) where W \u2208 RM\u00d7M\u00d7\u00b7\u00b7\u00b7\u00d7M is a trainable T\u2013way weight tensor and \u03a6(X) \u2208 RM\u00d7M\u00d7\u00b7\u00b7\u00b7\u00d7M is a rank 1 feature tensor, defined as\n\u03a6(X) = f\u03b8(x (1))\u2297 f\u03b8(x(2)) \u00b7 \u00b7 \u00b7 \u2297 f\u03b8(x(T )), (3)\nwhere we have used the operation of outer product \u2297, which is important in tensor calculus. For a tensor A of order N and a tensor B of order M their outer product C = A\u2297B is a tensor of order N +M defined as:\nCi1i2...iN j1j2...jM = Ai1i2\u00b7\u00b7\u00b7iNBj1j2\u00b7\u00b7\u00b7jM . (4) 1By logits we mean immediate outputs of the last hidden layer before applying nonlinearity. This term is adopted from classification tasks where neural network usually outputs logits and following softmax nonlinearity transforms them into valid probabilities.\nIt is known that equation 2 possesses the universal approximation property (it can approximate any function with any prescribed precision given sufficiently large M ) under mild assumptions on f\u03b8 (Cohen et al., 2016; Girosi & Poggio, 1990). 3.2 TENSOR DECOMPOSITIONS\nWorking the entire weight tensor W in eq. (2) is impractical for large M and T , since it requires exponential in T number of parameters. Thus, we compactly represent it using tensor decompositions, which will further lead to different neural network architectures, referred to as tensor networks (Cichocki et al., 2017).\nCP-decomposition The most basic decomposition is the so-called Canonical (CP) decomposition (Harshman, 1970; Carroll & Chang, 1970) which is defined as follows\nW = R\u2211 r=1 \u03bbrv (1) r \u2297 v(2)r \u00b7 \u00b7 \u00b7 \u2297 v(T )r , (5)\nwhere v(t)r \u2208 RM and minimal value of R such that decomposition equation 5 exists is called canonical rank of a tensor (CP\u2013rank). By substituting eq. (5) into eq. (2) we find that\n`(X) = R\u2211 r=1 \u03bbr [ \u3008f\u03b8(x(1)),v(1)r \u3009 \u2297 \u00b7 \u00b7 \u00b7 \u2297 \u3008f\u03b8(x(T )),v(T )r \u3009 ] = R\u2211 r=1 \u03bbr T\u220f t=1 \u3008f\u03b8(x(t)),v(t)r \u3009. (6)\nIn the equation above, outer products \u2297 are taken between scalars and coincide with the ordinary products between two numbers. However, we would like to keep this notation as it will come in handy later, when we generalize tensor decompositions to include various nonlinearities.\nTT-decomposition Another tensor decomposition is Tensor Train (TT) decomposition (Oseledets, 2011) which is defined as follows\nW = R1\u2211 r1=1 \u00b7 \u00b7 \u00b7 RT\u22121\u2211 rT\u22121=1 g(1)r0r1 \u2297 g (2) r1r2 \u2297 \u00b7 \u00b7 \u00b7 \u2297 g (T ) rT\u22121rT , (7)\nwhere g(t)rt\u22121rt \u2208 RM and r0 = rT = 1 by definition. If we gather vectors g (t) rt\u22121rt for all corresponding indices rt\u22121 \u2208 {1, . . . , Rt\u22121} and rt \u2208 {1, . . . , Rt} we will obtain three\u2013dimensional tensors G(t) \u2208 RM\u00d7Rt\u22121\u00d7Rt (for t = 1 and t = T we will get matrices G(1) \u2208 RM\u00d71\u00d7R1 and G(T ) \u2208 RM\u00d7RT\u22121\u00d71). The set of all such tensors {G(t)}Tt=1 is called TT\u2013cores and minimal values of {Rt}T\u22121t=1 such that decomposition equation 7 exists are called TT\u2013ranks. In the case of TT decomposition, the score function has the following form:\n`(X) = R1\u2211 r1=1 \u00b7 \u00b7 \u00b7 RT\u22121\u2211 rT\u22121=1 T\u220f t=1 \u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009. (8) 3.3 CONNECTION BETWEEN TT AND RNN\nNow we want to show that the score function for Tensor Train decomposition exhibits particular recurrent structure similar to that of RNN. We define the following hidden states:\nh(1) \u2208 RR1 : h(1)r1 = \u3008f\u03b8(x (1)),g(1)r0r1\u3009,\nh(t) \u2208 RRt : h(t)rt = Rt\u22121\u2211 rt\u22121=1 \u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009h (t\u22121) rt\u22121 t = 2, . . . , T.\n(9)\nSuch definition of hidden states allows for more compact form of the score function. Lemma 3.1. Under the notation introduced in eq. (9), the score function can be written as\n`(X) = h(T ) \u2208 R1.\nProof of Lemma 3.1 as well as the proofs of our main results from Section 5 were moved to Appendix A due to limited space.\nNote that with a help of TT\u2013cores we can rewrite eq. (9) in a more convenient index form:\nh (t) k = \u2211 i,j G(t)ijk f\u03b8(x (t))i h (t\u22121) j = \u2211 i,j G(t)ijk [ f\u03b8(x (t))\u2297 h(t\u22121) ] ij , k = 1, . . . , Rt, (10)\nwhere the operation of tensor contraction is used. Combining all weights from G(t) and f\u03b8(\u00b7) into a single variable \u0398(t)G and denoting the composition of feature map, outer product, and contraction as g : RRt\u22121 \u00d7 RN \u00d7 RN\u00d7Rt\u22121\u00d7Rt \u2192 RRt we arrive at the following vector form:\nh(t) = g(h(t\u22121),x(t); \u0398 (t) G ), h (t) \u2208 RRt . (11)\nThis equation can be considered as a generalization of hidden state equation for Recurrent Neural Networks as here all hidden states h(t) may in general have different dimensionalities and weight tensors \u0398(t)G depend on the time step. However, if we set R = R1 = \u00b7 \u00b7 \u00b7 = RT\u22121 and G = G (2) = \u00b7 \u00b7 \u00b7 = G(T\u22121) we will get simplified hidden state equation used in standard recurrent architectures:\nh(t) = g(h(t\u22121),x(t); \u0398G), h (t) \u2208 RR, t = 2, . . . , T \u2212 1. (12)\nNote that this equation is applicable to all hidden states except for the first h(1) = G(1)f\u03b8(x(1)) and for the last h(T ) = f>\u03b8 (x\n(T ))G(T )h(T\u22121), due to two\u2013dimensional nature of the corresponding TT\u2013cores. However, we can always pad the input sequence with two auxiliary vectors x(0) and x(T+1) to get full compliance with the standard RNN structure. Figure 1 depicts tensor network induced by TT decomposition with cores {G(t)}Tt=1. 4 GENERALIZED TENSOR NETWORKS\n 4.1 GENERALIZED OUTER PRODUCT\nIn the previous section we showed that tensor decompositions correspond to neural networks of specific structure, which are simplified versions of those used in practice as they contain multiplicative nonlinearities only. One possible way to introduce more practical nonlinearities is to replace outer product \u2297 in eq. (6) and eq. (10) with a generalized operator \u2297\u03be in analogy to kernel methods when scalar product is replaced by nonlinear kernel function. Let \u03be : R \u00d7 R \u2192 R be an associative and commutative binary operator (\u2200x, y, z \u2208 R : \u03be(\u03be(x, y), z) = \u03be(x, \u03be(y, z)) and \u2200x, y \u2208 R : \u03be(x, y) = \u03be(y, x)). Note that this operator easily generalizes to the arbitrary number of operands due to associativity. For a tensor A of order N and a tensor B of order M we define their generalized outer product C = A\u2297\u03be B as an (N +M) order tensor with entries given by:\nCi1...iN j1...jM = \u03be (Ai1...iN ,Bj1...jM ) . (13)\nNow we can replace \u2297 in eqs. (6) and (10) with \u2297\u03be and get networks with various nonlinearities. For example, if we take \u03be(x, y) = max(x, y, 0) we will get an RNN with rectifier nonlinearities; if we take \u03be(x, y) = ln(ex + ey) we will get an RNN with softplus nonlinearities; if we take \u03be(x, y) = xy we will get a simple RNN defined in the previous section. Concretely, we will analyze the following networks.\nGeneralized shallow network with \u03be\u2013nonlinearity\n\u2022 Score function:\n`(X) = R\u2211 r=1 \u03bbr [ \u3008f\u03b8(x(1)),v(1)r \u3009 \u2297\u03be \u00b7 \u00b7 \u00b7 \u2297\u03be \u3008f\u03b8(x(T )),v(T )r \u3009 ] =\nR\u2211 r=1 \u03bbr\u03be ( \u3008f\u03b8(x(1)),v(1)r \u3009, . . . , \u3008f\u03b8(x(T )),v(T )r \u3009 ) (14) \u2022 Parameters of the network:\n\u0398 = ( {\u03bbr}Rr=1 \u2208 R, {v(t)r } R,T r=1,t=1 \u2208 RM ) (15)\nGeneralized RNN with \u03be\u2013nonlinearity\n\u2022 Score function:\nh (t) k = \u2211 i,j G(t)ijk [ C(t)f\u03b8(x (t))\u2297\u03be h(t\u22121) ] ij = \u2211 i,j G(t)ijk \u03be ( [C(t)f\u03b8(x (t))]i,h (t\u22121) j ) `(X) = h(T )\n(16) \u2022 Parameters of the network:\n\u0398 = ( {C(t)}Tt=1 \u2208 RL\u00d7M , {G (t)}Tt=1 \u2208 RL\u00d7Rt\u22121\u00d7Rt )\n(17)\nNote that in eq. (16) we have introduced the matrices C(t) acting on the input states. The purpose of this modification is to obtain the plausible property of generalized shallow networks being able to be represented as generalized RNNs of width 1 (i.e., with all Ri = 1) for an arbitrary nonlinearity \u03be. In the case of \u03be(x, y) = xy, the matrices C(t) were not necessary, since they can be simply absorbed by G(t) via tensor contraction (see Appendix A for further clarification on these points).\nInitial hidden state Note that generalized RNNs require some choice of the initial hidden state h(0). We find that it is convenient both for theoretical analysis and in practice to initialize h(0) as unit of the operator \u03be, i.e. such an element u that \u03be(x, y, u) = \u03be(x, y) \u2200x, y \u2208 R. Henceforth, we will assume that such an element exists (e.g., for \u03be(x, y) = max(x, y, 0) we take u = 0, for \u03be(x, y) = xy we take u = 1), and set h(0) = u. For example, in eq. (9) it was implicitly assumed that h(0) = 1. 4.2 GRID TENSORS\nIntroduction of generalized outer product allows us to investigate RNNs with wide class of nonlinear activation functions, especially ReLU. While this change looks appealing from the practical viewpoint, it complicates following theoretical analysis, as the transition from obtained networks back to tensors is not straightforward.\nIn the discussion above, every tensor network had corresponding weight tensor W and we could compare expressivity of associated score functions by comparing some properties of this tensors, such as ranks (Khrulkov et al., 2018; Cohen et al., 2016). This method enabled comprehensive analysis of score functions, as it allows us to calculate and compare their values for all possible input sequences X = ( x(1), . . . ,x(T ) ) . Unfortunately, we can not apply it in case of generalized tensor networks, as the replacement of standard outer product \u2297 with its generalized version \u2297\u03be leads to the loss of conformity between tensor networks and weight tensors. Specifically, not for every generalized tensor network with corresponding score function `(X) now exists a weight tensor W such that `(X) = \u3008W ,\u03a6(X)\u3009. Also, such properties as universality no longer hold automatically and we have to prove them separately. Indeed as it was noticed in (Cohen & Shashua, 2016) shallow networks with \u03be(x, y) = max(x, 0) + max(y, 0) no longer have the universal approximation property. In order to conduct proper theoretical analysis, we adopt the apparatus of so-called grid tensors, first introduced in (Cohen & Shashua, 2016).\nGiven a set of fixed vectors X = { x(1), . . . ,x(M) } referred to as templates, the grid tensor of X is defined to be the tensor of order T and dimension M in each mode, with entries given by:\n\u0393`(X)i1i2...iT = ` (X) , X = ( x(i1),x(i2), . . . ,x(iT ) ) , (18)\nwhere each index it can take values from {1, . . . ,M}, i.e. we evaluate the score function on every possible input assembled from the template vectors {x(i)}Mi=1. To put it simply, we previously considered the equality of score functions represented by tensor decomposition and tensor network on set of all possible input sequences X = ( x(1), . . . ,x(T ) ) , x(t) \u2208 RN , and now we restricted this set to exponentially large but finite grid of sequences consisting of template vectors only.\nDefine the matrix F \u2208 RM\u00d7M which holds the values taken by the representation function f\u03b8 : RN \u2192 RM on the selected templates X:\nF , [ f\u03b8(x (1)) f\u03b8(x (2)) . . . f\u03b8(x (M)) ]> . (19)\nUsing the matrix F we note that the grid tensor of generalized shallow network has the following form (see Appendix A for derivation):\n\u0393`(X) = R\u2211 r=1 \u03bbr ( Fv(1)r ) \u2297\u03be ( Fv(2)r ) \u2297\u03be \u00b7 \u00b7 \u00b7 \u2297\u03be ( Fv(T )r ) . (20)\nConstruction of the grid tensor for generalized RNN is a bit more involved. We find that its grid tensor \u0393`(X) can be computed recursively, similar to the hidden state in the case of a single input sequence. The exact formulas turned out to be rather cumbersome and we moved them to Appendix A. 5 MAIN RESULTS\nWith grid tensors at hand we are ready to compare the expressive power of generalized RNNs and generalized shallow networks. In the further analysis, we will assume that \u03be(x, y) = max(x, y, 0), i.e., we analyze RNNs and shallow networks with rectifier nonlinearity. However, we need to make two additional assumptions. First of all, similarly to (Cohen & Shashua, 2016) we fix some templates X such that values of the score function outside of the grid generated by X are irrelevant for classification and call them covering templates. It was argued that for image data values of M of order 100 are sufficient (corresponding covering template vectors may represent Gabor filters). Secondly, we assume that the feature matrix F is invertible, which is a reasonable assumption and in the case of f\u03b8(x) = \u03c3(Ax + b) for any distinct template vectors X the parameters A and b can be chosen in such a way that the matrix F is invertible. 5.1 UNIVERSALITY\nAs was discussed in section 4.2 we can no longer use standard algebraic techniques to verify universality of tensor based networks. Thus, our first result states that generalized RNNs with \u03be(x, y) = max(x, y, 0) are universal in a sense that any tensor of order T and size of each mode beingm can be realized as a grid tensor of such RNN (and similarly of a generalized shallow network).\nTheorem 5.1 (Universality). Let H \u2208 RM\u00d7M\u00d7\u00b7\u00b7\u00b7\u00d7M be an arbitrary tensor of order T . Then there exist a generalized shallow network and a generalized RNN with rectifier nonlinearity \u03be(x, y) = max(x, y, 0) such that grid tensor of each of the networks coincides with H.\nPart of Theorem 5.1 which corresponds to generalized shallow networks readily follows from (Cohen & Shashua, 2016, Claim 4). In order to prove the statement for the RNNs the following two lemmas are used.\nLemma 5.1. Given two generalized RNNs with grid tensors \u0393`A(X), \u0393`B (X), and arbitrary \u03be-nonlinearity, there exists a generalized RNN with grid tensor \u0393`C (X) satisfying\n\u0393`C (X) = a\u0393`A(X) + b\u0393`B (X), \u2200a, b \u2208 R.\nThis lemma essentially states that the collection of grid tensors of generalized RNNs with any nonlinearity is closed under taking arbitrary linear combinations. Note that the same result clearly holds for generalized shallow networks because they are linear combinations of rank 1 shallow networks by definition.\nLemma 5.2. Let E(j1j2...jT ) be an arbitrary one\u2013hot tensor, defined as E(j1j2...jT )i1i2...iT = { 1, jt = it \u2200t \u2208 {1, . . . , T}, 0, otherwise.\nThen there exists a generalized RNN with rectifier nonlinearities such that its grid tensor satisfies\n\u0393`(X) = E(j1j2...jT ).\nThis lemma states that in the special case of rectifier nonlinearity \u03be(x, y) = max(x, y, 0) any basis tensor can be realized by some generalized RNN.\nProof of Theorem 5.1. By Lemma 5.2 for each one\u2013hot tensor E(i1i2...iT ) there exists a generalized RNN with rectifier nonlinearities, such that its grid tensor coincides with this tensor. Thus, by Lemma 5.1 we can construct an RNN with\n\u0393`(X) = \u2211\ni1,i2,...,iT\nHi1i2...idE (i1i2...iT ) = H.\nFor generalized shallow networks with rectifier nonlinearities see the proof of (Cohen & Shashua, 2016, Claim 4).\nThe same result regarding networks with product nonlinearities considered in (Khrulkov et al., 2018) directly follows from the well\u2013known properties of tensor decompositions (see Appendix A).\nWe see that at least with such nonlinearities as \u03be(x, y) = max(x, y, 0) and \u03be(x, y) = xy all the networks under consideration are universal and can represent any possible grid tensor. Now let us head to a discussion of expressivity of these networks. 5.2 EXPRESSIVITY\nAs was discussed in the introduction, expressivity refers to the ability of some class of networks to represent the same functions as some other class much more compactly. In our case the parameters defining size of networks are ranks of the decomposition, i.e. in the case of generalized RNNs ranks determine the size of the hidden state, and in the case of generalized shallow networks rank determines the width of a network. It was proven in (Cohen et al., 2016; Khrulkov et al., 2018) that ConvNets and RNNs with multiplicative nonlinearities are exponentially more expressive than the equivalent shallow networks: shallow networks of exponentially large width are required to realize the same score functions as computed by these deep architectures. Similarly to the case of ConvNets (Cohen & Shashua, 2016), we find that expressivity of generalized RNNs with rectifier nonlinearity holds only partially, as discussed in the following two theorems. For simplicity, we assume that T is even. Theorem 5.2 (Expressivity 1). For every value of R there exists a generalized RNN with ranks \u2264 R and rectifier nonlinearity which is exponentially more efficient than shallow networks, i.e., the corresponding grid tensor may be realized only by a shallow network with rectifier nonlinearity of width at least 2MT min(M,R) T/2.\nThis result states that at least for some subset of generalized RNNs expressivity holds: exponentially wide shallow networks are required to realize the same grid tensor. Proof of the theorem is rather straightforward: we explicitly construct an example of such RNN which satisfies the following description. Given an arbitrary input sequenceX = ( x(1), . . .x(T ) ) assembled from the templates, these networks (ifM = R) produce 0 ifX has the property that x(1) = x(2),x(3) = x(4), . . . ,x(T\u22121) = x(T ), and 1 in every other case, i.e. they measure pairwise similarity of the input vectors. A precise proof is given in Appendix A. In the case of multiplicative RNNs (Khrulkov et al., 2018) almost every network possessed this property. This is not the case, however, for generalized RNNs with rectifier nonlinearities.\nTheorem 5.3 (Expressivity 2). For every value of R there exists an open set (which thus has positive measure) of generalized RNNs with rectifier nonlinearity \u03be(x, y) = max(x, y, 0), such that for each RNN in this open set the corresponding grid tensor can be realized by a rank 1 shallow network with rectifier nonlinearity.\nIn other words, for every rank R we can find a set of generalized RNNs of positive measure such that the property of expressivity does not hold. In the numerical experiments in Section 6 and Appendix A we validate whether this can be observed in practice, and find that the probability of obtaining CP\u2013ranks of polynomial size becomes negligible with large T and R. Proof of Theorem 5.3 is provided in Appendix A.\nShared case Note that all the RNNs used in practice have shared weights, which allows them to process sequences of arbitrary length. So far in the analysis we have not made such assumptions about RNNs (i.e., G(2) = \u00b7 \u00b7 \u00b7 = G(T\u22121)). By imposing this constraint, we lose the property of universality; however, we believe that the statements of Theorems 5.2 and 5.3 still hold (without requiring that shallow networks also have shared weights). Note that the example constructed in the proof of Theorem 5.3 already has this property, and for Theorem 5.2 we provide numerical evidence in Appendix A. 6 EXPERIMENTS\nIn this section, we study if our theoretical findings are supported by experimental data. In particular, we investigate whether generalized tensor networks can be used in practical settings, especially in problems typically solved by RNNs (such as natural language processing problems). Secondly, according to Theorem 5.3 for some subset of RNNs the equivalent shallow network may have a low rank. To get a grasp of how strong this effect might be in practice we numerically compute an estimate for this rank in various settings.\nPerformance For the first experiment, we use two computer vision datasets MNIST (LeCun et al., 1990) and CIFAR\u201310 (Krizhevsky & Hinton, 2009), and natural language processing dataset for sentiment analysis IMDB (Maas et al., 2011). For the first two datasets, we cut natural images into rectangular patches which are then arranged into vectors x(t) (similar to (Khrulkov et al., 2018)) and for IMDB dataset the input data already has the desired sequential structure.\nFigure 2 depicts test accuracy on IMDB dataset for generalized shallow networks and RNNs with rectifier nonlinearity. We see that generalized shallow network of much higher rank is required to get the level of performance close to that achievable by generalized RNN. Due to limited space, we have moved the results of the experiments on the visual datasets to Appendix B.\nExpressivity For the second experiment we generate a number of generalized RNNs with different values of TT-rank r and calculate a lower bound on the rank of shallow network necessary to realize the same grid tensor (to estimate the rank we use the same technique as in the proof of Theorem 5.2). Figure 3 shows that for different values of R and generalized RNNs of the corresponding rank there exist shallow networks of rank 1 realizing the same grid tensor, which agrees well with Theorem 5.3. This result looks discouraging, however, there is also a positive observation. While increasing rank of generalized RNNs, more and more corresponding shallow networks will necessarily have exponentially higher rank. In practice we usually deal with RNNs of R = 102 \u2212 103 (dimension of hidden states), thus we may expect that effectively any function besides negligible set realized by generalized RNNs can be implemented only by exponentially wider shallow networks. The numerical results for the case of shared cores and other nonlinearities are given in Appendix B. 7 CONCLUSION\nIn this paper, we sought a more complete picture of the connection between Recurrent Neural Networks and Tensor Train decomposition, one that involves various nonlinearities applied to hidden states. We showed how these nonlinearities could be incorporated into network architectures and provided complete theoretical analysis on the particular case of rectifier nonlinearity, elaborating on points of generality and expressive power. We believe our results will be useful to advance theoretical understanding of RNNs. In future work, we would like to extend the theoretical analysis to most competitive in practice architectures for processing sequential data such as LSTMs and attention mechanisms. ACKNOWLEDGEMENTS\nWe would like to thank Andrzej Cichocki for constructive discussions during the preparation of the manuscript and anonymous reviewers for their valuable feedback. This work was supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001). A PROOFS\nLemma 3.1. Under the notation introduced in eq. (9), the score function can be written as\n`(X) = h(T ) \u2208 R1.\nProof.\nl(X) = R1\u2211 r1=1 \u00b7 \u00b7 \u00b7 RT\u22121\u2211 rT\u22121=1 T\u220f t=1 \u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009\n= R1\u2211 r1=1 \u00b7 \u00b7 \u00b7 RT\u22121\u2211 rT\u22121=1 T\u220f t=2\n\u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009 \u3008f\u03b8(x (1)),g(1)r0r1\u3009\ufe38 \ufe37\ufe37 \ufe38 h\n(1) r1\n= RT\u22121\u2211 rT\u22121=1 \u00b7 \u00b7 \u00b7 R1\u2211 r1=1 T\u220f t=2 \u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009h (1) r1\n= RT\u22121\u2211 rT\u22121=1 \u00b7 \u00b7 \u00b7 R2\u2211 r2=1 T\u220f t=3 \u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009 r1\u2211 r1=1\n\u3008f\u03b8(x(2)),g(2)r1r2\u3009h (1) r1\ufe38 \ufe37\ufe37 \ufe38\nh (2) r2\n= RT\u22121\u2211 rT\u22121=1 \u00b7 \u00b7 \u00b7 R2\u2211 r2=1 T\u220f t=3 \u3008f\u03b8(x(t)),g(t)rt\u22121rt\u3009h (2) r2\n= . . .\n= RT\u22121\u2211 rT\u22121=1 \u3008f\u03b8(x(T )),g(T )rT\u22121rT \u3009h (T\u22121) rT\u22121 = h (T ) rT = h (T ).\nProposition A.1. If we replace the generalized outer product \u2297\u03be in eq. (16) with the standard outer product \u2297, we can subsume matrices C(t) into tensors G(t) without loss of generality.\nProof. Let us rewrite hidden state equation eq. (16) after transition from \u2297\u03be to \u2297:\nh (t) k = \u2211 i,j G(t)ijk [ C(t)f\u03b8(x (t))\u2297 h(t\u22121) ] ij\n= \u2211 i,j G(t)ijk \u2211 l C (t) il f\u03b8(x (t))lh (t\u22121) j\n{ G\u0303(t)ljk = \u2211 i G(t)ijkC (t) il }\n= \u2211 l,j G\u0303(t)ljkf\u03b8(x(t))lh (t\u22121) j\n= \u2211 l,j G\u0303(t)ljk [ f\u03b8(x (t))\u2297 h(t\u22121) ] lj .\nWe see that the obtained expression resembles those presented in eq. (10) with TT-cores G(t) replaced by G\u0303(t) and thus all the reasoning applied in the absence of matrices C(t) holds valid.\nProposition A.2. Grid tensor of generalized shallow network has the following form (eq. (20)):\n\u0393`(X) = R\u2211 r=1 \u03bbr ( Fv(1)r ) \u2297\u03be ( Fv(2)r ) \u2297\u03be \u00b7 \u00b7 \u00b7 \u2297\u03be ( Fv(T )r ) .\nProof. Let X = ( x(i1),x(i2), . . . ,x(iT ) ) denote an arbitrary sequence of templates. Corresponding element of the grid tensor defined in eq. (20) has the following form:\n\u0393`(X)i1i2...iT = R\u2211 r=1 \u03bbr [( Fv(1)r ) \u2297\u03be ( Fv(2)r ) \u2297\u03be \u00b7 \u00b7 \u00b7 \u2297\u03be ( Fv(T )r )] i1i2...iT\n= R\u2211 r=1 \u03bbr ( Fv(1)r ) i1 \u2297\u03be ( Fv(2)r ) i2 \u2297\u03be \u00b7 \u00b7 \u00b7 \u2297\u03be ( Fv(T )r ) iT\n= R\u2211 r=1 \u03bbr\u03be ( \u3008f\u03b8(x(i1)),v(1)r \u3009, . . . , \u3008f\u03b8(x(iT )),v(T )r \u3009 ) = `(X).\nProposition A.3. Grid tensor of a generalized RNN has the following form:\n\u0393`,0(X) = h(0) \u2208 R1, \u0393`,1(X)km1 = \u2211 i,j G(1)ijk ( C(1)F> \u2297\u03be \u0393`,0 ) im1j \u2208 RR1\u00d7M ,\n\u0393`,2(X)km1m2 = \u2211 i,j G(2)ijk ( C(2)F> \u2297\u03be \u0393`,1 ) im2jm1 \u2208 RR2\u00d7M\u00d7M ,\n\u00b7 \u00b7 \u00b7 \u0393`,T (X)km1m2...mT = \u2211 i,j G(T )ijk ( C(T )F> \u2297\u03be \u0393`,T\u22121 ) imT jm1...mT\u22121 \u2208 R1\u00d7M\u00d7M\u00d7\u00b7\u00b7\u00b7\u00d7M ,\n\u0393`(X) = \u0393`,T (X)1,:,:,...,:\n(21)\nProof. Proof is similar to that of Proposition A.2 and uses eq. (16) to compute the elements of the grid tensor.\nLemma 5.1. Given two generalized RNNs with grid tensors \u0393`A(X), \u0393`B (X), and arbitrary \u03be-nonlinearity, there exists a generalized RNN with grid tensor \u0393`C (X) satisfying\n\u0393`C (X) = a\u0393`A(X) + b\u0393`B (X), \u2200a, b \u2208 R.\nProof. Let these RNNs be defined by the weight parameters\n\u0398A = ( {C(t)A } T t=1 \u2208 RLA\u00d7M , {G (t) A } T t=1 \u2208 RLA\u00d7Rt\u22121,A\u00d7Rt,A ) ,\nand\n\u0398B = ( {C(t)B } T t=1 \u2208 RLB\u00d7M , {G (t) B } T t=1 \u2208 RLB\u00d7Rt\u22121,B\u00d7Rt,B ) .\nWe claim that the desired grid tensor is given by the RNN with the following weight settings.\nC (t) C \u2208 R (LA+LB)\u00d7M\nC (t) C =\n[ C\n(t) A C (t) B ] G(1)C \u2208 R (LA+LB)\u00d71\u00d7(Rt,A+Rt,B)\n[G(1)C ]i,:,: =  [ [G(1)A ]i,:,: 0 ] , i \u2208 {1, . . . , LA}[\n0 [G(1)B ](i\u2212LA),:,: ] , i \u2208 {LA + 1, . . . , LA + LB}\nG(t)C \u2208 R (LA+LB)\u00d7(Rt\u22121,A+Rt\u22121,B)\u00d7(Rt,A+Rt,B), 1 < t < T\n[G(t)C ]i,:,: =  [ [G(t)A ]i,:,: 0 0 0 ] , i \u2208 {1, . . . , LA} [ 0 0\n0 [G(t)B ](i\u2212LA),:,:\n] , i \u2208 {LA + 1, . . . , LA + LB}\nG(T )C \u2208 R (LA+LB)\u00d7(Rt\u22121,A+Rt\u22121,B)\u00d71\n[G(T )C ]i,:,: =  [ a[G(T )A ]i,:,: 0 ] , i \u2208 {1, . . . , LA} [ 0\nb[G(T )B ](i\u2212LA),:,:\n] , i \u2208 {LA + 1, . . . , LA + LB}.\nIt is straightforward to verify that the network defined by these weights possesses the following property:\nh (t) C =\n[ h (t) A\nh (t) B\n] , 0 < t < T,\nand\nh (T ) C = ah (T ) A + bh (T ) B ,\nconcluding the proof. We also note that these formulas generalize the well\u2013known formulas for addition of two tensors in the Tensor Train format (Oseledets, 2011).\nProposition A.4. For any associative and commutative binary operator \u03be, an arbitrary generalized rank 1 shallow network with \u03be\u2013nonlinearity can be represented in a form of generalized RNN with unit ranks (R1 = \u00b7 \u00b7 \u00b7 = RT\u22121 = 1) and \u03be\u2013nonlinearity.\nProof. Let \u0398 = ( \u03bb, {v(t)}Tt=1 ) be the parameters specifying the given generalized shallow network. Then the following weight settings provide the equivalent generalized RNN (with h(0) being the unity of the operator \u03be).\nC(t) = ( v(t) )> \u2208 R1\u00d7M ,\nG(t) = 1, t < T, G(T ) = \u03bb.\nIndeed, in the notation defined above, hidden states of generalized RNN have the following form:\nh(t) = G(t)\u03be ( [C(t)f\u03b8(x (t))],h(t\u22121) ) = \u03be ( \u3008f\u03b8(x(t)),v(t)\u3009,h(t\u22121) ) , t = 1, . . . , T \u2212 1\nh(T ) = \u03bb\u03be ( \u3008f\u03b8(x(T )),v(T )\u3009,h(T\u22121) ) .\nThe score function of generalized RNN is given by eq. (16):\n`(X) = h(T ) = \u03bb\u03be ( \u3008f\u03b8(x(T )),v(T )\u3009,h(T\u22121) ) = \u03bb\u03be ( \u3008f\u03b8(x(T )),v(T )\u3009, \u3008f\u03b8(x(T\u22121)),v(T\u22121)\u3009,h(T\u22122)\n) . . .\n= \u03bb\u03be ( \u3008f\u03b8(x(T )),v(T )\u3009, . . . , \u3008f\u03b8(x(1)),v(1)\u3009 ) ,\nwhich coincides with the score function of rank 1 shallow network defined by parameters \u0398.\nLemma 5.2. Let E(j1j2...jT ) be an arbitrary one\u2013hot tensor, defined as E(j1j2...jT )i1i2...iT = { 1, jt = it \u2200t \u2208 {1, . . . , T}, 0, otherwise.\nThen there exists a generalized RNN with rectifier nonlinearities such that its grid tensor satisfies\n\u0393`(X) = E(j1j2...jT ).\nProof. It is known that the statement of the lemma holds for generalized shallow networks with rectifier nonlinearities (see (Cohen & Shashua, 2016, Claim 4)). Based on Proposition A.4 and Lemma 5.1 we can conclude that it also holds for generalized RNNs with rectifier nonlinearities.\nProposition A.5. Statement of Theorem 5.1 holds with \u03be(x, y) = xy.\nProof. By assumption the matrix F is invertible. Consider the following tensor H\u0302 : H\u0302i1i2...iT = \u2211\nj1,...,jT\nHj1,...,jT F\u22121j1i1 . . .F \u22121 jT iT ,\nand the score function in the form of eq. (2):\n`(X) = \u3008H\u0302,\u03a6(X)\u3009.\nNote that by construction for any input assembled from the template vectors we obtain ` ( (x(i1), . . . ,x(iT )) ) = Hi1...iT . By taking the standard TT and CP decompositions of H\u0302 which always exist (Oseledets, 2011; Kolda & Bader, 2009), and using Lemma 3.1 and eq. (6) we conclude that universality holds.\nTheorem 5.2 (Expressivity 1). For every value of R there exists a generalized RNN with ranks \u2264 R and rectifier nonlinearity which is exponentially more efficient than shallow networks, i.e., the corresponding grid tensor may be realized only by a shallow network with rectifier nonlinearity of width at least 2MT min(M,R) T/2.\nIn order to prove the theorem we will use the standard technique of matricizations. Simply put, by matricizing a tensor we reshape it into a matrix by splitting the indices of a tensor into two collections, and converting each one of them into one long index. I.e., for a tensor A of order T with mode sizes being m, we split the set {1, . . . , T} into two non\u2013overlapping ordered subsets s and t, and define the matricization A(s,t) \u2208 RM |s|\u00d7M |t| by simply reshaping (and possibly transposing) the tensor A according to s and t. We will consider the matricization obtained by taking sodd = (1, 3, . . . , T \u2212 1),\nteven = (2, 4, . . . , T ), i.e., we split out even and odd modes. A typical application of matricization is the following: suppose that we can upper and lower bound the ordinary matrix rank of a certain matricization using the parameters specifying each of the architectures being analyzed. Then under the assumption that both architectures realize the same grid tensor (and thus ranks of the matricization coincide) we can compare the sizes of corresponding architectures. In the case of generalized shallow networks with rectifier nonlinearity we will use the following result (Cohen & Shashua, 2016, Claim 9).\nLemma A.1. Let \u0393`(X) be a grid tensor generated by a generalized shallow network of rank R and \u03be(x, y) = max(x, y, 0). Then\nrank [ \u0393`(X) ](sodd,teven) \u2264 RTM 2 ,\nwhere the ordinary matrix rank is assumed.\nThis result is a generalization of a well\u2013known property of the standard CP-decomposition (i.e. if \u03be(x, y) = xy), which states that for a rank R decomposition, the matrix rank of every matricization is bounded by R.\nIn order to prove Theorem 5.2 we will construct an example of a generalized RNN with exponentially large matrix rank of the matricization of grid tensor, from which and Lemma A.1 the statement of the theorem will follow.\nLemma A.2. Without loss of generality assume that xi = ei (which can be achieved since F is invertible). Let 1(p,q) denote the matrix of size p\u00d7q with each entry being 1, I(p,q) denote the matrix of size p\u00d7q with I(p,q)ij = \u03b4ij (\u03b4 being the Kronecker symbol), and b = [1\u2212min(M,R),0>R\u22121] \u2208 R1\u00d7R. Consider the following weight setting for a generalized RNN with \u03be(x, y) = max(x, y, 0).\nC(t) = { 1M,M \u2212 IM,M , t odd, 1M+1,M \u2212 IM+1,M , t even.\nG(t) =  IM,R \u2208 RM\u00d71\u00d7R, t odd,[ IM,R\nb\n] \u2208 R(M+1)\u00d7R\u00d71, t even.\nThen grid tensor \u0393`(X) of this RNN satisfies\nrank [ \u0393`(X) ](sodd,teven) \u2265 min(M,R)T/2, where the ordinary matrix rank is assumed.\nProof. Informal description of the network defined by weights in the statement in the lemma is the following. Given some input vector ei it is first transformed into its bitwise negative ei, and its first R components are saved into the hidden state. The next block then measures whether the first min(R,M) components of the current input coincide with the hidden state (after again taking bitwise negative). If this is the case, the hidden state is set 0 and the process continues. Otherwise, the hidden state is set to 1 which then flows to the output independently of the other inputs. In other words, for all the inputs of the form X = (xi1 ,xi1 , . . . ,xiT/2 ,xiT/2) with i1 \u2264 R, . . . , iT/2 \u2264 R we obtain that `(X) = 0, and in every other case `(X) = 1. Thus, we obtain that [ \u0393`(X)\n](sodd,teven) is a matrix with all the entries equal to 1, except for min(M,R)T/2 entries on the diagonal, which are equal to 0. Rank of such a matrix is RT/2 + 1 if R < M and MT/2 otherwise, and the statement of the lemma follows.\nBased on these two lemmas we immediately obtain Theorem 5.2.\nProof of Theorem 5.2. Consider the example constructed in the proof of Lemma A.2. By Lemma A.1 the rank of the shallow network with rectifier nonlinearity which is able to represent the same grid tensor is at least 2TM min(M,R) T/2.\nTheorem 5.3 (Expressivity 2). For every value of R there exists an open set (which thus has positive measure) of generalized RNNs with rectifier nonlinearity \u03be(x, y) = max(x, y, 0), such that for each RNN in this open set the corresponding grid tensor can be realized by a rank 1 shallow network with rectifier nonlinearity.\nProof. As before, let us denote by I(p,q) a matrix of size p \u00d7 q such that I(p,q)ij = \u03b4ij , and by a(p1,p2,...pd) we denote a tensor of size p1 \u00d7 \u00b7 \u00b7 \u00b7 \u00d7 pd with each entry being a (sometimes we will omit the dimensions when they can be inferred from the context). Consider the following weight settings for a generalized RNN.\nC(t) = ( F> )\u22121 ,\nG(t) =  2(M,1,R), t = 1\n1(M,R,R), t = 2, . . . , T \u2212 1 1(M,R,1), t = T\nThe RNN defined by these weights has the property that \u0393`(X) is a constant tensor with each entry being 2(MR)T\u22121, which can be trivially represented by a rank 1 generalized shallow network. We will show that this property holds under a small perturbation of C(t),G(t) and F. Let us denote each of these perturbation (and every tensor appearing size of which can be assumed indefinitely small) collectively by \u03b5. Applying eq. (21) we obtain (with \u03be(x, y) = max(x, y, 0)).\n\u0393`,0(X) = 0 \u2208 R1, \u0393`,1(X)km1 = \u2211 i,j G(1)ijk ( (I(M,M) + \u03b5)\u2297\u03be 0 ) im1j = 1\u2297 (2 + \u03b5),\n\u0393`,2(X)km1m2 = \u2211 i,j G(2)ijk ( (I(M,M) + \u03b5)\u2297\u03be \u0393`,1(X) ) im2jm1 = 1\u2297 (2MR + \u03b5)\u2297 1,\n\u00b7 \u00b7 \u00b7 \u0393`,T (X)km1m2...mT = 1\u2297 (2(MR)T\u22121 + \u03b5)\u2297 1 \u00b7 \u00b7 \u00b7 \u2297 1, \u0393`(X) = \u0393`,T (X)1,:,:,...,: = (2(MR)T\u22121 + \u03b5)\u2297 1 \u00b7 \u00b7 \u00b7 \u2297 1,\nwhere we have used a simple property connecting \u2297\u03be with \u03be(x, y) = max(x, y, 0) and ordinary \u2297: if for tensors A and B each entry of A is greater than each entry of B, A\u2297\u03beB = A\u22971. The obtained grid tensors can be represented using rank 1 generalized shallow networks with the following weight settings.\n\u03bb = 1,\nvt =\n{ F\u22121\u03b5 (2(MR) T\u22121 + \u03b5), t = 1,\n0, t > 1,\nwhere F\u03b5 is the feature matrix of the corresponding perturbed network. B ADDITIONAL EXPERIMENTS\nIn this section we provide the results additional computational experiments, aimed to provide more thorough and complete analysis of generalized RNNs.\nDifferent \u03be-nonlinearities In this paper we presented theoretical analysis of rectifier nonlinearity which corresponds to \u03be(x, y) = max(x, y, 0). However, there is a number of other associative binary operators \u03be which can be incorporated in generalized tensor networks. Strictly speaking, every one of them has to be carefully explored theoretically in order to speak about their generality and expressive power, but for now we can compare them empirically.\nTable 1 shows the performance (accuracy on test data) of different nonlinearities on MNIST, CIFAR\u201410, and IMDB datasets for classification. Although these problems are not considered hard to solve, we see that the right choice of nonlinearity can lead to a significant boost in performance. For the experiments on the visual datasets we used T = 16,m = 32, R = 64 and for the experiments on the IMDB dataset we had T = 100,m = 50, R = 50. Parameters of all networks were optimized using Adam (learning rate \u03b1 = 10\u22124) and batch size 250.\nExpressivity in the case of shared cores We repeat the expressivity experiments from Section 6 in the case of equal TT\u2013cores (G(2) = \u00b7 \u00b7 \u00b7 = G(T\u22121)). We observe that similar to the case of different cores, there always exist rank 1 generalized shallow networks which realize the same score function as generalized RNN of higher rank, however, this situation seems too unlikely for big values of R.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Incremental decoding and training methods for simultaneous translation in neural machine translation , Authors: Fahim Dalvi Nadir Durrani... , 2018\n2.  Title: Don\u2019t until the final verb wait: Reinforcement learning for simultaneous machine translation , Authors: Alvin Grissom II He He... , 2014\n3.  Title: Learning to translate in real-time with neural machine translation , Authors: Jiatao Gu Graham Neubig... , 2017\n4.  Title: Learning hard alignments with variational inference , Authors: Dieterich Lawson Chung-Cheng Chiu... , 2018\n5.  Title: Learning online alignments with continuous rewards policy gradient , Authors: Yuping Luo Chung-Cheng Chiu... , 2017\n6.  Title: STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework , Authors: Mingbo Ma Liang Huang... , 2019\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Simultaneous machine translation models start generating a target sequence before they have encoded the source sequence. Recent approaches for this task either apply a fixed policy on a state-of-the art Transformer model, or a learnable monotonic attention on a weaker recurrent neural network-based structure. In this paper, we propose a new attention mechanism, Monotonic Multihead Attention (MMA), which extends the monotonic attention mechanism to multihead attention. We also introduce two novel and interpretable approaches for latency control that are specifically designed for multiple attention heads. We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We analyze how the latency controls affect the attention span and we study the relationship between the speed of a head and the layer it belongs to. Finally, we motivate the introduction of our model by analyzing the effect of the number of decoder layers and heads on quality and latency.", "Recent work on simultaneous machine translation falls into three categories. In the first one, models use a rule-based policy for reading input and writing output. Cho & Esipova (2016) propose a WaitIf-* policy to enable an offline model to decode simultaneously. Ma et al. (2019) propose a wait-k policy where the model first reads k tokens, then alternates between read and write actions. Dalvi et al. (2018) propose an incremental decoding method, also based on a rule-based schedule. In the second category, a flexible policy is learnt from data. Grissom II et al. (2014) introduce a Markov chain to phrase-based machine translation models for simultaneous machine translation, in which they apply reinforcement learning to learn the read-write policy based on states. Gu et al. (2017) introduce an agent which learns to make decisions on when to translate from the interaction with a pre-trained offline neural machine translation model. Luo et al. (2017) used continuous rewards policy gradient for online alignments for speech recognition. Lawson et al.", "This paper proposes a fully transformer-based monotonic attention framework that extends the idea of MILK. Though the idea of monotonic multi-head attention sounds interesting, I still have some questions below:\n\nAbout the method:\n   1. Is that possible that the MMA would have worse latency than MILK since all the attention heads need to agree to write while MILK only has one attention head?\n   2. Is there any attention order between different attention head?\n   3. I think the MMA only could control the latency during training time, which would produce different models with different latency. Is there any way that enables MMA to control the latency during inference time? Can we change the latency for on given model by tuning the requirements mentioned in (1)?\n\nAbout the experiments:\n    1. Do you have any explanation of why both MMA-H and MMA-IL have better BLEU when AL is small? The results in fig 2 seem counterintuitive. \n    2. I suggest the authors do more analysis of the difference between different attention heads to prove the effectiveness of MMA. \n    3. For the left two figures in fig 4, which one is the baseline, and which one is the proposed model?\n\nI also suggest the authors present more real sample analysis and discussions about the experiments.\n6: Weak Accept"], "input": "1 INTRODUCTION\nSimultaneous machine translation adds the capability of a live interpreter to machine translation: a simultaneous model starts generating a translation before it has finished reading the entire source sentence. Such models are useful in any situation where translation needs to be done in real time. For example, simultaneous models can translate live video captions or facilitate conversations between people speaking different languages. In a usual translation model, the encoder first reads the entire sentence, then the decoder writes the target sentence. On the other hand, a simultaneous neural machine translation model alternates between reading the input and writing the output using either a fixed or learned policy.\nMonotonic attention mechanisms fall into the flexible policy category, in which the policies are automatically learned from data. Recent work exploring monotonic attention variants for simultaneous translation include: hard monotonic attention (Raffel et al., 2017), monotonic chunkwise attention (MoChA) (Chiu & Raffel, 2018) and monotonic infinite lookback attention (MILk) (Arivazhagan et al., 2019). MILk in particular has shown better quality/latency trade-offs than fixed policy approaches, such as wait-k (Ma et al., 2019) or wait-if-* (Cho & Esipova, 2016) policies. MILk also outperforms hard monotonic attention and MoChA; while the other two monotonic attention mechanisms only consider a fixed window, MILk computes a softmax attention over all previous encoder states, which may be the key to its improved latency-quality tradeoffs. These monotonic attention approaches also provide a closed-form expression for the expected alignment between source and target tokens.\nHowever, monotonic attention-based models, including the state-of-the-art MILk, were built on top of RNN-based models. RNN-based models have been outperformed by the recent state-of-the-art Transformer model (Vaswani et al., 2017), which features multiple encoder-decoder attention layers and multihead attention at each layer.\n\u2217Work conducted during an internship at Facebook 1The code is available at https://github.com/pytorch/fairseq/tree/master/\nexamples/simultaneous_translation\nWe thus propose monotonic multihead attention (MMA), which combines the high translation quality from multilayer multihead attention and low latency from monotonic attention. We propose two variants, Hard MMA (MMA-H) and Infinite Lookback MMA (MMA-IL). MMA-H is designed with streaming systems in mind where the attention span must be limited. MMA-IL emphasizes the quality of the translation system. We also propose two novel latency regularization methods. The first encourages the model to be faster by directly minimizing the average latency. The second encourages the attention heads to maintain similar positions, preventing the latency from being dominated by a single or a few heads.\nThe main contributions of this paper are: (1) A novel monotonic attention mechanism, monotonic multihead attention, which enables the Transformer model to perform online decoding. This model leverages the power of the Transformer and the efficiency of monotonic attention. (2) Better latency/quality tradeoffs compared to the MILk model, the previous state-of-the-art, on two standard translation benchmarks, IWSLT15 English-Vietnamese (En-Vi) and WMT15 German-English (DeEn). (3) Analyses on how our model is able to control the attention span and on the relationship between the speed of a head and the layer it belongs to. We motivate the design of our model with an ablation study on the number of decoder layers and the number of decoder heads. 2 MONOTONIC MULTIHEAD ATTENTION MODEL\nIn this section, we review the monotonic attention-based approaches in RNN-based encoder-decoder models. We then introduce the two types of Monotonic Multihead Attention (MMA) for Transformer models: MMA-H and MMA-IL. Finally, we introduce strategies to control latency and coverage. 2.1 MONOTONIC ATTENTION\nThe hard monotonic attention mechanism (Raffel et al., 2017) was first introduced in order to achieve online linear time decoding for RNN-based encoder-decoder models. We denote the input sequence as x = {x1, ..., xT }, and the corresponding encoder states as m = {m1, ...,mT }, with T being the length of the source sequence. The model generates a target sequence y = {y1, ..., yU} with U being the length of the target sequence. At the i-th decoding step, the decoder only attends to one encoder state mti with ti = j. When generating a new target token yi, the decoder chooses whether to move one step forward or to stay at the current position based on a Bernoulli selection probability pi,j , so that ti \u2265 ti\u22121. Denoting the decoder state at the i-th position, starting from j = ti\u22121, ti\u22121 + 1, ti\u22121 + 2, ..., this process can be calculated as follows: 2\nei,j = MonotonicEnergy(si\u22121,mj) (1) pi,j = Sigmoid (ei,j) (2) zi,j \u223c Bernoulli(pi,j) (3)\nWhen zi,j = 1, we set ti = j and start generating a target token yi; otherwise, we set ti = j+ 1 and repeat the process. During training, an expected alignment \u03b1 is introduced to replace the softmax attention. It can be calculated in a recurrent manner, shown in Equation 4:\n\u03b1i,j = pi,j j\u2211 k=1\n( \u03b1i\u22121,k\nj\u22121\u220f l=k (1\u2212 pi,l)\n)\n= pi,j ( (1\u2212 pi,j\u22121)\n\u03b1i,j\u22121 pi,j\u22121 + \u03b1i\u22121,j ) (4) Raffel et al. (2017) also introduce a closed-form parallel solution for the recurrence relation in Equation 5:\n\u03b1i,: = pi,:cumprod(1\u2212 pi,:)cumsum (\n\u03b1i\u22121,: cumprod(1\u2212 pi,:)\n) (5)\nwhere cumprod(x) = [1, x1, x1x2, ..., \u220f|x|\u22121 i=1 xi] and cumsum(x) = [x1, x1 + x2, ..., \u2211|x| i=1 xi]. In practice, the denominator in Equation 5 is clamped into a range of [ , 1] to avoid numerical instabilities introduced by cumprod. Although this monotonic attention mechanism achieves online\n2Note that during training, to encourage discreteness, Raffel et al. (2017) added a zero mean, unit variance pre-sigmoid noise to ei,j .\nlinear time decoding, the decoder can only attend to one encoder state. This limitation can diminish translation quality as there may be insufficient information for reordering.\nMoreover, the model lacks a mechanism to adjust latency based on different requirements at decoding time. To address these issues, Chiu & Raffel (2018) introduce Monotonic Chunkwise Attention (MoChA), which allows the decoder to apply softmax attention to a fixed-length subsequence of encoder states. Alternatively, Arivazhagan et al. (2019) introduce Monotonic Infinite Lookback Attention (MILk) which allows the decoder to access encoder states from the beginning of the source sequence. The expected attention for the MILk model is defined in Equation 6.\n\u03b2i,j = |x|\u2211 k=j ( \u03b1i,k exp(ui,j)\u2211k l=1 exp(ui,l) ) (6) 2.2 MONOTONIC MULTIHEAD ATTENTION\nPrevious monotonic attention approaches are based on RNN encoder-decoder models with a single attention and haven\u2019t explored the power of the Transformer model. 3 The Transformer architecture (Vaswani et al., 2017) has recently become the state-of-the-art for machine translation (Barrault et al., 2019). An important feature of the Transformer is the use of a separate multihead attention module at each layer. Thus, we propose a new approach, Monotonic Multihead Attention (MMA), which combines the expressive power of multihead attention and the low latency of monotonic attention.\nMultihead attention allows each decoder layer to have multiple heads, where each head can compute a different attention distribution. Given queries Q, keys K and values V , multihead attention MultiHead(Q,K, V ) is defined in Equation 7.\nMultiHead(Q,K, V ) = Concat(head1, ..., headH)WO where headh = Attention ( QWQh ,KW K h , V W V h , ) (7)\nThe attention function is the scaled dot-product attention, defined in Equation 8: Attention(Q,K, V ) = Softmax ( QKT\u221a dk ) V (8)\nThere are three applications of multihead attention in the Transformer model:\n1. The Encoder contains self-attention layers where all of the queries, keys and values come from previous layers.\n2. The Decoder contains self-attention layers that allow each position in the decoder to attend to all positions in the decoder up to and including that position.\n3. The Encoder-Decoder attention contains multihead attention layers where queries come from the previous decoder layer and the keys and values come from the output of the encoder. Every decoder layer has a separate encoder-decoder attention.\nFor MMA, we assign each head to operate as a separate monotonic attention in encoder-decoder attention.\nFor a transformer with L decoder layers and H attention heads per layer, we define the selection process of the h-th head encoder-decoder attention in the l-th decoder layer as\nel,hi,j =\n( mjW K l,h(si\u22121W Q l,h) T\n\u221a dk\n) i,j\n(9)\npl,hi,j = Sigmoid(ei,j) (10)\nzl,hi,j \u223c Bernoulli(pi,j) (11) 3MILk was based on a strengthened RNN-based model called RNMT+. The original RNMT+ model (Chen et al., 2018) uses multihead attention, computes attention only once, and then concatenates that single attention layer to the output of each decoder layer block. However, the RNMT+ model used for MILk in Arivazhagan et al. (2019) only uses a single head.\nwhere Wl,h is the input projection matrix, dk is the dimension of the attention head. We make the selection process independent for each head in each layer. We then investigate two types of MMA, MMA-H(ard) and MMA-IL(infinite lookback). For MMA-H, we use Equation 4 in order to calculate the expected alignment for each layer each head, given pl,hi,j . For MMA-IL, we calculate the softmax energy for each head as follows:\nul,hi,j = SoftEnergy =\n( mjW\u0302 K l,h(si\u22121W\u0302 Q l,h) T\n\u221a dk\n) i,j\n(12)\nand then use Equation 6 to calculate the expected attention. Each attention head in MMA-H hardattends to one encoder state. On the other hand, each attention head in MMA-IL can attend to all previous encoder states. Thus, MMA-IL allows the model to leverage more information for translation, but MMA-H may be better suited for streaming systems with stricter efficiency requirements. Finally, our models use unidirectional encoders: the encoder self-attention can only attend to previous states, which is also required for simultaneous translation.\nAt inference time, our decoding strategy is shown in Algorithm 1. For each l, h, at decoding step i, we apply the sampling processes discussed in subsection 2.1 individually and set the encoder step at tl,hi . Then a hard alignment or partial softmax attention from encoder states, shown in Equation 13, will be retrieved to feed into the decoder to generate the i-th token. The model will write a new target token only after all the attentions have decided to write. In other words, the heads that have decided to write must wait until the others have finished reading.\ncli = Concat(c l,1 i , c l,2 i , ..., c l,H i )\nwhere cl,hi = fcontext(h, t l,h i ) =  mtl,hi\nMMA-H\ntl,hi\u2211 j=1\nexp ( ul,hi,j ) \u2211tl,hi j=1 exp ( ul,hi,j )mj MMA-IL (13)\nFigure 1 illustrates a comparison between our model and the monotonic model with one attention head. Compared with the monotonic model, the MMA model is able to set attention to different positions so that it can still attend to previous states while reading each new token. Each head can adjust its speed on-the-fly. Some heads read new inputs, while the others can stay in the past to retain the source history information. Even with the hard alignment variant (MMA-H), the model is still able to preserve the history information by setting heads to past states. In contrast, the hard monotonic model, which only has one head, loses the previous information at the attention layer. 2.3 LATENCY CONTROL\nEffective simultaneous machine translation must balance quality and latency. At a high level, latency measures how many source tokens the model has read until a translation is generated. The model we have introduced in subsection 2.2 is not able to control latency on its own. While MMA allows simultaneous translation by having a read or write schedule for each head, the overall latency is determined by the fastest head, i.e. the head that reads the most. It is possible that a head always reads new input without producing output, which would result in the maximum possible latency. Note that the attention behaviors in MMA-H and MMA-IL can be different. In MMA-IL, a head reaching the end of the sentence will provide the model with maximum information about the source sentence. On the other hand, in the case of MMA-H, reaching the end of sentence for a head only\nAlgorithm 1 MMA monotonic decoding. Because each head is independent, we compute line 3 to 16 in parallel\nInput: x = source tokens,h = encoder states, i = 1, j = 1, tl,h0 = 1, y0 = StartOfSequence. 1: while yi\u22121 6= EndOfSequence do 2: tmax = 1 3: h = empty sequence 4: for l\u2190 1 to L do 5: for h\u2190 1 to H do 6: for j \u2190 tl,hi\u22121 to |x| do 7: pl,hi,j = Sigmoid ( MonotonicEnergy(si\u22121,mj )\n) 8: if pl,hi,j > 0.5 then 9: tl,hi = j\n10: cl,hi = fcontext(h, t l,h i ) 11: Break 12: else 13: if j > tmax then 14: Read token xj 15: Calculate state hj and append to h 16: tmax = j 17: cli = Concat(c l,1 i , c l,2 i , ..., c l,H i ) 18: sli = DecoderLayer l(sl1:i\u22121, s l\u22121 1:i\u22121, c l i) 19: yi = Output(sLi ) 20: i = i+ 1\ngives a hard alignment to the end-of-sentence token, which provides very little information to the decoder. Furthermore, it is possible that an MMA-H attention head stays at the beginning of sentence without moving forward. Such a head would not cause latency issues but would degrade the model quality since the decoder would not have any information about the input. In addition, this behavior is not suited for streaming systems.\nTo address these issues, we introduce two latency control methods. The first one is weighted average latency, shown in Equation 14:\ngWi = exp(gl,hi )\u2211L\nl=1 \u2211H h=1 exp(g l,h i ) gl,hi (14)\nwhere gl,hi = \u2211|x| j=1 j\u03b1i,j . Then we calculate the latency loss with a differentiable latency metric C.\nLavg = C ( gW )\n(15)\nLike Arivazhagan et al. (2019), we use the Differentiable Average Lagging. It is important to note that, unlike the original latency augmented training in Arivazhagan et al. (2019), Equation 15 is not the expected latency metric given C, but weighted average C on all the attentions. The real expected latency is g\u0302 = maxl,h ( gl,h ) instead of g\u0304, but using this directly would only affect the speed of the fastest head. Equation 15 can control every head in a way that the faster heads will be automatically assigned to larger weights and slower heads will also be moderately regularized. For MMA-H models, we found that the latency of are mainly due to outliers that skip almost every token. The weighted average latency loss is not sufficient to control the outliers. We therefore introduce the head divergence loss, the average variance of expected delays at each step, defined in Equation 16:\nLvar = 1\nLH L\u2211 l=1 H\u2211 h=1 ( gl,hi \u2212 g\u0304i )2 (16)\nwhere g\u0304i = 1LH \u2211 gi The final objective function is presented in Equation 17: L(\u03b8) = \u2212 log(y | x; \u03b8) + \u03bbavgLavg + \u03bbvarLvar (17) where \u03bbavg , \u03bbvar are hyperparameters that control both losses. Intuitively, while \u03bbavg controls the overall speed, \u03bbvar controls the divergence of the heads. Combining these two losses, we are able to dynamically control the range of attention heads so that we can control the latency and the reading buffer. For MMA-IL model, we only use Lavg; for MMA-H we only use Lvar. 3 EXPERIMENTAL SETUP\n 3.1 EVALUATION METRICS\nWe evaluate our model using quality and latency. For translation quality, we use tokenized BLEU 4 for IWSLT15 En-Vi and detokenized BLEU with SacreBLEU (Post, 2018) for WMT15 De-En. For latency, we use three different recent metrics, Average Proportion (AP) (Cho & Esipova, 2016), Average Lagging (AL) (Ma et al., 2019) and Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019) 5. We remind the reader of the metric definitions in Appendix A.2. 3.2 DATASETS\nWe evaluate our method on two standard machine translation datasets, IWSLT14 En-Vi and WMT15 De-En. Statistics of the datasets can be found in Table 1. For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing.\nIWSLT15 English-Vietnamese TED talks from IWSLT 2015 Evaluation Campaign (Cettolo et al., 2016). We follow the settings from Luong & Manning (2015) and Raffel et al. (2017). We replace words with frequency less than 5 by <unk>. We use tst2012 as a validation set tst2013 as a test set. WMT15 German-English We follow the setting from Arivazhagan et al. (2019). We apply byte pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared vocabulary with 32K symbols. We use newstest2013 as validation set and newstest2015 as test set. 3.3 MODELS\nWe evaluate MMA-H and MMA-IL models on both datasets. The MILK model we evaluate on IWSLT15 En-Vi is based on Luong et al. (2015) rather than RNMT+ (Chen et al., 2018). In general, our offline models use unidirectional encoders, i.e. the encoder self-attention can only attend to previous states, and greedy decoding. We report offline model performance in Table 2 and the effect of using unidirectional encoders and greedy decoding in Table 3. For MMA models, we replace the encoder-decoder layers with MMA and keep other hyperparameter settings the same as the offline model. Detailed hyperparameter settings can be found in subsection A.1. We use the Fairseq library (Ott et al., 2019) for our implementation.\n4We acquire the data from https://nlp.stanford.edu/projects/nmt/, which is tokenized. We do not have the tokenizer which processed this data, thus we report tokenize d BLEU for IWSLT15\n5Latency metrics are computed on BPE tokens for WMT15 De-En \u2013 consistent with Arivazhagan et al. (2019) \u2013 and on word tokens for IWSLT15 En-Vi.\n6 Luong & Manning (2015) report a BLEU score of 23.0 but they didn\u2019t mention what type of BLEU score they used. This score is from our implementation on the data aquired from https://nlp.stanford.edu/projects/nmt/ 4 RESULTS\nIn this section, we present the main results of our model in terms of latency-quality tradeoffs, ablation studies and analyses. In the first study, we analyze the effect of the variance loss on the attention span. Then, we study the effect of the number of decoder layers and decoder heads on quality and latency. We also provide a case study for the behavior of attention heads in an example. Finally, we study the relationship between the rank of an attention head and the layer it belongs to. 4.1 LATENCY-QUALITY TRADEOFFS\nWe plot the quality-latency curves for MMA-H and MMA-IL in Figure 2. The BLEU and latency scores on the test sets are generated by setting a latency range and selecting the checkpoint with best BLEU score on the validation set. We use differentiable average lagging (Arivazhagan et al., 2019) when setting the latency range. We find that for a given latency, our models obtain a better translation quality. While MMA-IL tends to have a decrease in quality as the latency decreases, MMA-H has a small gain in quality as latency decreases: a larger latency does not necessarily mean an increase in source information available to the model. In fact, the large latency is from the outlier attention heads, which skip the entire source sentence and point to the end of the sentence. The outliers not only increase the latency but they also do not provide useful information. We introduce the attention variance loss to eliminate the outliers, as such a loss makes the attention heads focus on the current context for translating the new target token.\nIt is interesting to observe that MMA-H has a better latency-quality tradeoff than MILk7 even though each head only attends to only one state. Although MMA-H is not yet able to handle an arbitrarily long input (without resorting to segmenting the input), since both encoder and decoder self-attention have an infinite lookback, that model represents a good step in that direction. 4.2 ATTENTION SPAN\nIn subsection 2.3, we introduced the attention variance loss to MMA-H in order to prevent outlier attention heads from increasing the latency or increasing the attention span. We have already evaluated the effectiveness of this method on latency in subsection 4.1. We also want to measure the difference between the fastest and slowest heads at each decoding step. We define the average\n7The numbers of MILk on WMT15 De-En are from Arivazhagan et al. (2019)\nattention span in Equation 18:\nS\u0304 = 1\n|y|  |y|\u2211 i max l,h tl,hi \u2212min l,h tl,hi  (18) It estimates the reading buffer we need for streaming translation. We show the relation between the average attention span versus \u03bbvar in Figure 3. As expected, the average attention span is reduced as we increase \u03bbvar. 4.3 EFFECT ON NUMBER OF LAYERS AND NUMBER OF HEADS\nOne motivation to introduce MMA is to adapt the Transformer, which is the current state-of-the-art model for machine translation, to online decoding. Important features of the Transformer architecture include having a separate attention layer for each decoder layer block and multihead attention. In this section, we test the effect of these two components on the offline, MMA-H, and MMA-IL models from a quality and latency perspective. We report quality as measured by detokenized BLEU and latency as measured by DAL on the WMT13 validation set in Figure 4. We set \u03bbavg = 0.2 for MMA-IL and \u03bbvar = 0.2 for MMA-H.\nThe offline model benefits from having more than one decoder layer. In the case of 1 decoder layer, increasing the number of attention heads is beneficial but in the case of 3 and 6 decoder layers, we do not see much benefit from using more than 2 heads. The best performance is obtained for 3 layers and 2 heads (6 effective heads). The MMA-IL model behaves similarly to the offline model, and the best performance is observed with 6 layers and 4 heads (24 effective heads). For MMA-H, with 1 layer, performance improves with more heads. With 3 layers, the single-head setting is the most effective (3 effective heads). Finally, with 6 layers, the best performance is reached with 16 heads (96 effective heads).\nThe general trend we observe is that performance improves as we increase the number of effective heads, either from multiple layers or multihead attention, up to a certain point, then either plateaus or degrades. This motivates the introduction of the MMA model.\nWe also note that latency increases with the number of effective attention heads. This is due to having fixed loss weights: when more heads are involved, we should increase \u03bbvar or \u03bbavg to better control latency. 4.4 ATTENTION BEHAVIORS\nWe characterize attention behaviors by providing a running example of MMA-H and MMA-IL, shown in Figure 5. Each curve represents the path that an attention head goes through at inference time. For MMA-H, shown in Figure 5a, we found that when the source and target tokens have the same order, the attention heads behave linearly and the distance between fastest head and slowest head is small. For example, this can be observed from partial sentence pair \u201cI also didn\u2019t know that\u201d and target tokens \u201cT\u00f4i c\u0169ng kh\u00f4ng bi\u1ebft r\u1eb1ng\u201d, which have the same order. However, when the source tokens and target tokens have different orders, such as \u201cthe second step\u201d and \u201cb\u01b0\u1edbc (step) th\u1ee9 hai (second)\u201d, the model will generate \u201cb\u01b0\u1edbc (step)\u201d first and some heads will stay in the past to retain\nthe information for later reordered translation \u201cth\u1ee9 hai (second)\u201d. We can also see that the attention heads have a near-diagonal trajectory, which is appropriate for streaming inputs.\nThe behavior of the heads in MMA-IL models is shown in Figure 5b. Notice that we remove the partial softmax alignment in this figure. We don\u2019t expect streaming capability for MMA-IL: some heads stop at early position of the source sentence to retain the history information. Moreover, because MMA-IL has more information when generating a new target token, it tends to produce translations with better quality. In this example, the MMA-IL model has a better translation on \u201cisolate the victim\u201d than MMA-H (\u201cl\u00e0 c\u00f4 l\u1eadp n\u1ea1n nh\u00e2n\u201d vs \u201cl\u00e0 t\u00e1ch bi\u1ec7t n\u1ea1n nh\u00e2n\u201d) 4.5 RANK OF THE HEADS\nIn Figure 6, we calculate the average and standard deviation of rank of each head when generating every target token. For MMA-IL, we find that heads in lower layers tend to have higher rank and are thus slower. However, in MMA-H, the difference of the average rank are smaller. Furthermore, the standard deviation is very large which means that the order of the heads in MMA-H changes frequently over the inference process. 6 CONCLUSION\nIn this paper, we propose two variants of the monotonic multihead attention model for simultaneous machine translation. By introducing two new targeted loss terms which allow us to control both latency and attention span, we are able to leverage the power of the Transformer architecture to achieve better quality-latency trade-offs than the previous state-of-the-art model. We also present detailed ablation studies demonstrating the efficacy and rationale of our approach. By introducing these stronger simultaneous sequence-to-sequence models, we hope to facilitate important applications, such as high-quality real-time interpretation between human speakers. A APPENDIX\nA.1 HYPERPARAMETERS\nThe hyperparameters we used for offline and monotonic transformer models are defined in Table 4.\nA.2 LATENCY METRICS DEFINITIONS\nGiven the delays g = {g1, g2, ..., g|y|} of generating each target token, AP, AL and DAL are defined in Table 5.\nA.3 DETAILED RESULTS\nWe provide the detailed results in Figure 2 as Table 6 and Table 7.\nA.4 THRESHOLD OF READING ACTION\nWe explore a simple method that can adjust system\u2019s latency at inference time without training new models. In Algorithm 1 line 8, 0.5 was used as an threshold. One can set different threshold p during the inference time to control the latency. We run the pilot experiments on IWSLT15 En-Vi dataset and the results are shown as Table 8. Although this method doesn\u2019t require training new model, it dramatically hurts the translation quality.\nA.5 AVERAGE LOSS FOR MMA-H\nWe explore applying a simple average instead of a weighted average loss to MMA-H. The results are shown in Figure 7 and Table 9. We find that even with very large weights, we are unable to reduce the overall latency. In addition, we find that the weighted average loss severely affects the translation quality negatively. On the other hand, the divergence loss we propose in Equation 16 can efficiently reduce the latency while retaining relatively good translation quality for MMA-H models.\nWMT15 Germen-English", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Sorting out lipschitz function approximation , Authors: Cem Anil James Lucas... , 2018\n2.  Title: A unified view of piecewise linear neural network verification , Authors: Rudy Bunel Ilker Turkaslan... , 2018\n3.  Title: Maximum resilience of artificial neural networks , Authors: Chih-Hong Cheng Georg N\u00fchrenberg... , 2017\n4.  Title: Parseval networks: Improving robustness to adversarial examples , Authors: Moustapha Cisse Piotr Bojanowski... , 2017\n5.  Title: Output range analysis for deep neural networks , Authors: Souradeep Dutta Susmit Jha... , 2017\n6.  Title: Regularisation of neural networks by enforcing lipschitz continuity , Authors: Henry Gouk Eibe Frank... , 2018\n7.  Title: On the effectiveness of interval bound propagation for training verifiably robust models , Authors: Sven Gowal Krishnamurthy Dvijotham... , 2018\n8.  Title: Differentiable abstract interpretation for provably robust neural networks , Authors: Matthew Mirman Timon Gehr... , 2018\n9.  Title: Fast and effective robustness certification , Authors: Gagandeep Singh Timon Gehr... , 2018\n10.  Title: Evaluating robustness of neural networks with mixed integer programming , Authors: Vincent Tjeng Kai Y Xiao... , 2017\n11.  Title: Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks , Authors: Yusuke Tsuzuku Issei Sato... , 2018\n12.  Title: Enresnet: Resnet ensemble via the feynman-kac , Authors: Bao Wang Binjie Yuan... , 2018\n13.  Title: Towards fast computation of certified robustness for relu networks , Authors: Tsuiwei Weng Huan Zhang... , 2018\n14.  Title: Efficient neural network robustness certification with general activation functions , Authors: Huan Zhang Tsui-Wei Weng... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Robustness is an important property to guarantee the security of machine learning models. It has recently been demonstrated that strong robustness certificates can be obtained on ensemble classifiers generated by input randomization. However, tight robustness certificates are only known for symmetric norms including `0 and `2, while for asymmetric norms like `1, the existing techniques do not apply. By converting the likelihood ratio into a one dimensional mixed random variable, we derive the first tight `1 robustness certificate under isotropic Laplace distributions in binary case. Empirically, the deep networks smoothed by Laplace distributions yield the state-of-the-art certified robustness in `1 norm on CIFAR-10 and ImageNet.", "Robustness of a model can be defined in various aspects. For example, Feynman-Kac Formalism can be used to improve robustness (Wang et al., 2018). In this paper, we focus on the classification setting, where the goal is to provide guarantee of a constant prediction among a small region specified via some metric. The robustness certificate can be either exact or conservative, so long as a constant prediction is guaranteed in the certified region. Note that the certification of a completely black-box model requires checking the prediction values at every point around the point of interest, which is clearly infeasible. A practical certification algorithm inevitably has to specify and leverage the functional structure of the classifier in use to reduce the required computation. Exact certificates. The exact certificate of deep networks is typically derived for the networks with a piecewise linear activation function such as ReLU. Such networks have an equivalent mixed integer linear representation (Cheng et al., 2017; Lomuscio & Maganti, 2017; Dutta et al., 2017; Bunel et al., 2018). Hence, one may apply mixed integer linear programming to find the worst case adversary within any convex polyhedron such as an `1-ball or `\u221e-ball. Despite the elegant solution, the complexity is, in general, NP-hard and the algorithms are not scalable to large problems(Tjeng et al., 2017). Conservative certificates. A conservative certificate can be more scalable than the exact methods, since one can trade-off the accuracy of certification with efficiency (Gouk et al., 2018; Tsuzuku et al., 2018; Cisse et al., 2017; Anil et al., 2018; Hein & Andriushchenko, 2017). For example, one can relax the search of the worst case adversary as a simpler optimization problem that only bounds the effect of such adversary. Alternatively, people also consider the robustness problem in a modular way, where the robustness guarantee can be derived iteratively for each layer in the deep networks by considering the feasible values for each hidden layer (Gowal et al., 2018; Weng et al., 2018; Zhang et al., 2018; Mirman et al., 2018; Singh et al., 2018).", "In this paper, the author derived a tight ell_1, which is not the symmetric norm, robustness certificates under isotropic Laplace distributions. Experimentally, the authors showed that the deep networks smoothed \nby Laplace distributions yield the state-of-the-art certified robustness in ell_1 norm on the CIFAR-10 \nand ImageNet. To find the ell_1 certificate, the authors first identified the tight robustness certificate, for attacking the model in one particular direction, say the first direction. To show that any other perturbation directions cannot lead to a worse result, the authors convert the d dimensional likelihood function into a one-dimensional function, and the authors used relaxation for different perturbations and show that the worst-case result is bounded by the previously identified direction.  However, I have the following concerns about this work:\n\n1. Theoretically, the authors only showed the certificate is tight for binary classification. I would suggest\nthe author change their claim in the abstract.\n\n2. What is M on page 3 which is used without definition after definition 1?\n\n3. Can you give a concrete continuous probability distribution that leads to the scenario in Fig.~3?\n\n4. Can you extend the analysis to a multi-class classification scenario?\n\n5. Besides randomized smoothing on the input images, recently Wang et al showed that randomize the deep nets can\nalso improve the deep nets and they gave it a nice theoretical interpretation. Here is the reference: Bao Wang, Binjie Yuan, Zuoqiang Shi, Stanley J. Osher. ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies, arXiv:1811.10745, NeurIPS, 2019\n\nOverall, since this work is a straightforward integration of some existing work, I think this\npaper lack novelty. Please address the above questions in rebuttal.\n3: Weak Reject"], "input": "1 INTRODUCTION\nhave done a series of nice works in practical sights or theoretical sights (Zheng et al., 2016; Gouk et al., 2018). Among them, certifiably robustness is valuable, since it can withstand all attacks within a norm ball and has a nice theoretical and practical outcome. However, most work cannot deal with the case for general neural networks.\nDeep networks are flexible models that are widely adopted in various applications. However, it has been shown that such models are vulnerable against adversary (Szegedy et al., 2014). Concretely, an unnoticeable small perturbation on the input can cause a typical deep model to change predictions arbitrarily. The phenomenon raises the concerns of the security of deep models, and hinders its deployment in decision-critical applications. Indeed, the certification of robustness is a pre-requisite when AI-generated decisions may have important consequences.\nCertifying the robustness of a machine learning model is challenging, especially for modern deep learning models that are over-parameterized and effectively black-box. Hence, the existing approaches mainly rely on empirical demonstration against specific adversarial attack algorithms (Goodfellow et al., 2015; Madry et al., 2018; Finlay et al., 2019). However, this line of works can give a false sense of security. Indeed, successful defense against the existing attack algorithms does not guarantee actual robustness against any adversaries that may appear in the future.\nRecently, the adversarial robustness community has shifted the focus towards establishing certificates that prove the robustness of deep learning models. The certificate can be either exact or conservative, so long as the certified region cannot exhibit any adversarial examples. Given the over-parameterized deep models and modern high-dimensional datasets, scalability becomes a key property for the certification algorithms, as many methods are computationally intractable.\nOur work is based on the novel modeling scheme that generates ensembles of a fixed black-box classifier based on input randomization (Cohen et al., 2019). Under this framework, tight robustness certificates can be obtained with only the ensemble prediction values and randomization parameters. Given appropriate choices of distributions, the robustness guarantee can be derived for `2 or `0 norms (Cohen et al., 2019; Lee et al., 2019). The tightness simply implies that any point outside the certified region is an adversarial example in the worst case. However, the derivations of the previous results heavily relies on the fact that the target norm (`2 or `0) is symmetric, therefore analyzing any perturbation direction for attacking the model gives the same certification guarantee.\nIn contrast, `1 norm is asymmetric. That is, for a given `1 ball centered at the origin, if we move another `1 ball also from the origin by a distance \u03b4, where \u2016\u03b4\u20161 is fixed, then the overlapped region\nbetween the two `1 balls may have different shapes and sizes (See Figure 1). The characterization of this overlapped region is the key step for proving tight certificates, hence the existing techniques do not apply for `1 norm.\nIn this work, we derive a tight `1 robustness guarantee under isotropic Laplace distributions. The Laplace distribution can be interpreted as an infinite mixture of uniform distributions over `1-norm balls, which is a natural \u201cconjugate\u201d distribution for `1 norm. Due to asymmetry, we first identified the tight robustness certificate for attacking the model in one particular direction, \u03b4 = (\u2016\u03b4\u20161, 0, \u00b7 \u00b7 \u00b7 , 0). To show that other perturbation directions cannot lead to worse results, we convert the d dimensional likelihood function into an one dimensional function, where we apply relaxation for various \u03b4 and show that the worst case result is bounded by the specific direction (\u2016\u03b4\u20161, 0, \u00b7 \u00b7 \u00b7 , 0). Theoretically, our certificate is tight in the binary classification setting. In the multi-class classification setting, our certificate is always tighter than the previous certificate proposed by Lecuyer et al. (2019). The theoretical improvement always leads to superior empirical results on certifying the same model, where we demonstrate the result on CIFAR-10 and ImageNet with ResNet models. Moreover, the proposed robustness certificate on models smoothed by Laplace distributions also outperforms the same models trained and certified using Gaussian distributions (Cohen et al., 2019) in `1 certified robustness, where the Gaussian-based robustness certificate is adapted from `2 norm. 3 PRELIMINARIES\nDefinition 1 (Laplace distribution) Given \u03bb \u2208 R+, d \u2208 Z+, we use L(\u03bb) to denote the Laplace distribution in dimension d with parameter \u03bb. The p.d.f. of L(\u03bb) is denoted as L(x;\u03bb) ,\n1 (2\u03bb)d exp(\u2212\u2016x\u20161\u03bb ).\nAs we will see in Lemma 3.1, in smoothing analysis, we are interested in the likelihood ratio of two random variables X = and Y = \u03b4 + (here \u223c L(\u03bb) and \u03b4 \u2208 Rd is a fixed vector). Specifically,\n\u00b5Y (x) \u00b5X(x) = exp ( \u2212 1 \u03bb (\u2016x\u2212 \u03b4\u20161 \u2212 \u2016x\u20161) )\nTherefore, the likelihood ratio between two d dimensional random variables is controlled by a one dimensional random variable T (x) , \u2016x \u2212 \u03b4\u20161 \u2212 \u2016x\u20161, where x \u223c L(\u03bb). This transformation is crucial in our analysis, and it is easy to see that T (x) is a mixed random variable, since Px(T (x) = \u2016\u03b4\u20161) > 0. In our analysis, we need to calculate the inverse of c.d.f. of T (x). However, since T (x) is a mixed random variable, sometimes the inverse may not exist. See Figure 3 for illustration, where the inverse of the probability 0.85 does not exist. To deal with this case, we have the following modified version of Neyman-Pearson Lemma, with the proof in Appendix A.\nLemma 3.1 (Neyman-Pearson Lemma for mixed random variables). Let X \u223c L (\u03bb) and Y \u223c L (\u03bb) + \u03b4. Let h : Rd \u2192 {0, 1} be any deterministic or random function. Given any \u03b2 \u2208 R, and S\u2032 \u2286 { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 = \u03b2 } :\n1. If S = { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 > \u03b2 } \u222a S\u2032, and P(h(X) = 1) \u2265 P(X \u2208 S) then P(h(Y ) = 1) \u2265 P(Y \u2208 S)\n2. If S = { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 < \u03b2 } \u222aS\u2032, and P(h(X) = 1) \u2264 P(X \u2208 S), then P(h(Y ) = 1) \u2264 P(Y \u2208 S) 4 MAIN RESULTS\nIn this paper, we apply the randomized smoothing technique (Cohen et al., 2019) for getting robustness certificates, which works as follows. Given an input x, we perturb it with , s.t. \u223c L(\u03bb). Then instead of evaluating the robustness of the original function f(x), we evaluate g(x) , arg maxc P (f(x+ ) = c), which is effectively the smoothed version of f(x). 4.1 ROBUSTNESS CERTIFICATES FOR GENERAL CASES\nOur first theorem proves that for the smoothed classifier g, and a given input x, there always exists a robust radius R, such that any perturbation \u03b4 s.t. \u2016\u03b4\u20161 \u2264 R, does not alter the prediction of g(x).\nTheorem 1 Let f : Rd \u2192 Y be deterministic or random function, and let \u223c L(\u03bb). Let g(x) = arg maxc P (f(x+ ) = c). Suppose PA, PB \u2208 [0, 1] are such that\nP (f(x+ ) = cA) \u2265 PA \u2265 PB \u2265 max c 6=cA P(f(x+ ) = c)\nThen g(x+ \u03b4) = g(x),\u2200\u2016\u03b4\u20161 \u2264 R, where\nR = max\n{ \u03bb\n2 log(PA/PB),\u2212\u03bb log(1\u2212 PA + PB)\n} (1)\nSome Remarks:\n1. When PA \u2192 1 or PB \u2192 0, we can get R \u2192 \u221e. It is reasonable since the Laplace distribution is supported over Rd, PA \u2192 1 is equivalent to f = cA almost everywhere.\n2. Compared with (Lecuyer et al., 2019) where they have R = \u03bb2 log(PA/PB), our bound\nis better if 1\u22122PA(1\u2212PA)\u2212\n\u221a 1\u22124PA(1\u2212PA)\n2PA \u2264 PB \u2264\n1\u22122PA(1\u2212PA)+ \u221a\n1\u22124PA(1\u2212PA) 2PA . See\nFigure 4 for illustration, where we use baseline to denote the bound R = \u03bb2 log(PA/PB).\nProof sketch: (The full proof is in Appendix B) For arbitrarily classifier f , we can transform it into a random smoothing classifier g using random smoothing technique, where g returns class cA with probability no less than PA, and class cB with probability no more than PB . Below we list the three main ideas we used in our proof:\n1. How to deal with an arbitrary f with PA and PB?\nFollowing Cohen et al. (2019), we use Neyman-Pearson Lemma to transform the relation between P(f(X) = cA) and P(f(Y ) = cA) into the relation between P(X \u2208 A) and P(Y \u2208 A). From Lemma 3.1, Neyman-Pearson Lemma still holds for mixed random variables. 2. How to deal with the relation between X = and Y = + \u03b4?\nInspired by Lecuyer et al. (2019), we use the DP-form inequality (P (Y \u2208 A) \u2264 e P (X \u2208 A)) to deal with the relation between P (X \u2208 A) and P (Y \u2208 A). In Laplace distribution, = \u2016\u03b4\u20161\u03bb . 3. Take complements to get tighter bound.\nWhen PA or B < 1/2, the above DP-form inequality gets tighter. Therefore, we analyze Ac when PA \u2265 1/2 to get a new bound, and compare it with the baseline expression. We derive this bound by Neyman-Pearson Lemma in this work, but an alternative approach is using Re\u0301nyi Divergence (Li et al., 2018). 4.2 TIGHT ROBUSTNESS CERTIFICATES FOR BINARY CASE\nAlthough we get improved result over Lecuyer et al. (2019), the bound in Theorem 1 is not tight since it considers the general case with multiple categories. In this section, we first present our result for binary classification (Theorem 2), which further improves over Theorem 1.\nTheorem 2 (binary case) Let f : Rd \u2192 Y be deterministic or random function, and let \u223c L(\u03bb). Let g(x) = arg maxc P (f(x + ) = c). Suppose there are only two classes cA and cB , and PA \u2208 [ 12 , 1] s.t. P (f(x+ ) = cA) \u2265 PA Then g(x+ \u03b4) = g(x),\u2200\u2016\u03b4\u20161 \u2264 R, for\nR = \u2212\u03bb log[2(1\u2212 PA)] (2)\nScretch of the proof: (The full proof is in Appendix C) Theorem 2 is a special binary case of Theorem 1. We can use a method similar to Theorem 1 to get the results. However, it is worth noting that in binary cases, our new improved bound in Theorem 1 always dominates the bound by Lecuyer et al. (2019). Moreover, our bound in Eqn. (2) is tight, as shown below.\nTheorem 3 (tight bound in binary case) In the same setting as Theorem 2, assume PA + PB \u2264 1 and PA \u2265 12 . \u2200R\n\u2032 > \u2212\u03bb log[2(1 \u2212 PA)], \u2203 base classifier f\u2217 and perturbation \u03b4\u2217 with g\u2217(x) = arg maxc P (f \u2217(x+ ) = c) and \u2016\u03b4\u20161 = R\u2032, s.t. g\u2217(x) 6= g\u2217(x+ \u03b4\u2217).\nScretch of the proof:(The full proof is in Appendix C) For Theorem 3, we prove that the bound in Theorem 2 is tight by calculating the results in one-dimensional case, where \u03b4 = (\u2016\u03b4\u20161, 0, . . . , 0). By calculating, we show that when \u03b4 = (\u2016\u03b4\u20161, 0, . . . , 0)\nP(Y \u2208 B) = \u222b \u2016\u03b4\u20161+\u03bb log[2PB ] \u2212\u221e 1 2\u03bb exp (\u2212|x| \u03bb )dx\n= { exp(\u2016\u03b4\u20161\u03bb )PB when \u2016\u03b4\u20161 \u2264 \u2212\u03bb log[2PB ] 1\u2212 1\n4PB exp(\u2212\u2016\u03b4\u20161\u03bb ) o.w.\nTherefore, when \u2016\u03b4\u20161 \u2264 \u2212\u03bb log[2PB ], the DP-inequality is tight. The worst-case \u03b4 appears in the one-dimension case.\nFigure 5 shows the reason why the inequality is tight. When \u03b4 is small, for P(X \u2208 B), the set B we selected satisfies \u2200x \u2208 B, T (x) = \u2212\u2016\u03b4\u20161 (red part). When P(Y \u2208 B) is considered, it moves set S towards left by step \u03b4. However, due to the small \u03b4, S after moving still satisfies the requirement of \u2200x \u2208 S, T (x) = \u2212\u2016\u03b4\u20161 (blue part). Therefore, the inequality is tight. 4.3 METHOD COMPARISON\nWe compared our method with Cohen et al.\u2019s and Lecuyer et al.\u2019s in binary case, see Table 1. We plot the curves in Figure 6. As we can see, under the same variance of each noise, our method can reach better robustness radius. Below we show simple derivations of the bounds in Table 1.\nRobustness radius of Lecuyer et al. (2019)\nUsing the basic inequality from differential privacy, we have:\nP (f(X) = cA) \u2264 exp(\u03b2)P (f(Y ) = cA) P (f(Y ) = cB) \u2264 exp(\u03b2)P (f(X) = cB)\nwhere \u03b2 = \u2016\u03b4\u20161/\u03bb. The above two inequalities show that to guarantee P (f(Y ) = cA) > P (f(Y ) = cB), it suffices to show that:\nP (f(X) = cA) > exp(2\u03b2)P (f(X) = cB)\nSo plug in \u03b2 = \u2016\u03b4\u20161/\u03bb, we have \u2016\u03b4\u20161 \u2264 \u03bb2 log(PA/PB). Furthermore, in binary case, we can plug in PB = 1\u2212 PA, and get the robustness radius: R = \u03bb2 log(PA/1\u2212 PA). Robustness radius of Cohen et al. (2019)\nDenote Bp,r(c) = {x : \u2016x \u2212 c\u2016p \u2264 r}. Since we know that B1,r(c) \u2282 B2,r(c), so the radius in (Cohen et al., 2019) can be directly used in `1 form, which is \u03c3\u03a6\u22121(PA).\nBesides, since B1,r+ (c) 6\u2282 B2,r(c) whatever > 0 is. And (Cohen et al., 2019) is an exact robustness guarantee, so we have that the best `1 form that isotropic Gaussian noise random smoothing can get is \u03c3\u03a6\u22121(PA).\nFinally we will prove that \u2212\u03bb log[2(1 \u2212 PA)] \u2265 \u03bb2 log(PA/1\u2212 PA). For simple denotion, we just set PA = p \u2265 0.5. So it is sufficient to show that\u2212\u03bb log[2(1\u2212p)] \u2265 \u03bb2 log(p/(1\u2212p)). By applying exponential operation, it suffices to show that 12(1\u2212p) \u2265 \u221a p 1\u2212p , which is simply p(1\u2212 p) \u2264 1 4 . 5 EXPERIMENTS\n 5.1 IMPLEMENTATION DETAILS\nMonte Carlo. Since we cannot get the exact value of PA, we have to use Monte Carlo method to get the approximate value of PA. More specifically, we take multiple random samples from the Laplace distribution to estimate PA. One way to do it is grouping the samples and get PA using non-parametric estimation.\nIn our experiments, we applied two different types of training, as described below.\nType1-Training: The first method is intuitive, and was applied in (Cohen et al., 2019). In the training process, we add into inputs:\ninputs = inputs + noise\nwhere the noise is sampled from isotropic Laplace distribution.\nType2-Training: The second method was recently proposed by Salman et al. (2019). The idea is to use adversarial noise samples instead of the raw noise samples in a neighborhood to train the base classifier. Each training sample can be decomposed to\ninputs = inputs + noise + perturbation\nwhere the noise comes from an isotropic Laplace distribution, and the perturbation is found approximately by the gradient of loss with respect to the input. Concretely, if we denote the loss as L and the input as x, the perturbation \u2206 can be calculated by \u2206 = a \u2217 sign(\u2207xL(\u03b8, x, y)), where a is a constant.\nEvaluation Index. In this paper, we choose certified accuracy as our evaluation index. Robustness certified accuracy at radius r refers to the proportion of correctly classified samples with at least robustness radius r. Specifically, if a group of samples with capacity n is {xi}, i = 1, 2, . . . , n, its corresponding certified robustness radius is Ri. An index xi represent if the sample is classified correctly. If the sample is correctly classified, xi = 1, otherwise xi = 0. For a given r, the corresponding robustness certified accuracy is defined as \u03b1 = \u2211n i=1 xi1(Ri \u2265 r)/n, where 1(\u00b7) is an indicator function.\nHowever, from Section 5.1 we know that we cannot calculate the exact robustness radius R, so we use its R\u0302 to approximate R, which leads to a \u201capproximate robustness certified accuracy\u201d(\u03b1\u0302), which is calculated by\n\u03b1\u0302 = n\u2211 i=1 xiI(R\u0302i \u2265 r)/n (3)\nCohen et al. (2019) demonstrates that when significance level of R\u0302 is small, the difference between these two quantities is negligible. In practice, we plot approximate certified accuracy \u03b1\u0302 as a function of radius r. From Eqn. (3), we know that \u03b1\u0302 is non-increasing w.r.t. r. And when r \u2192\u221e, \u03b1\u0302\u2192 0. Hyperparameters. In our paper, we set all our hyperparameters following Cohen et al. (2019). Specifically, we set significance level to 0.001. n0 = 100 in Monte Carlo simulation (used to get bound for \u03b1\u0302) and n = 100, 000 in estimation part (used to estimate \u03b1\u0302). Moreover, we test three parameters in CIFAR-10 dataset and ImageNet dataset (\u03c3 = 0.25, 0.50, 1.00). Since (Cohen et al., 2019) use Gaussian noise and we use Laplace noise, they should have the same standard deviation during comparison. This requires that \u03c3 = \u221a 2\u03bb. 5.2 EXPERIMENTAL RESULTS\nResults on ImageNet and CIFAR-10. We applied random smoothing on CIFAR-10 (Krizhevsky (2009)) and ImageNet (Deng et al. (2009)) respectively. On each data set, we trained several random smoothing models with differential standard deviation \u03c3 for Laplace noise. In order to keep in line with Cohen et al.\u2019s method and make a comparison, we select \u03c3 = 0.25, 0.50, 1.00 on CIFAR-10, and ImageNet, corresponding parameter \u03bb = \u03c3/ \u221a 2.\nFigure 6 draws the certified accuracy achieved by smoothing with each sigma. For the ImageNet dataset, we only use the most basic training method (Type1 Training). For the CIFAR-10 data set, we use two training methods (Type 1 and Type 2 Training). We can see that the smaller sigma performs better when the radius is smaller. As the noise gets bigger, the accuracy becomes lower, but the robustness guarantee becomes higher. The dashed black line shows the empirical robust accuracy of an undefended classifier from Cohen et al. (2019).\nComparison with baseline.\nWe will show our comparison results in the following. Based on Table. 1, we will test our method on CIFAR-10 with the ResNet110 architecture as well as Type1 and Type2 training, and ImageNet with ResNet50 architecture as well as Type1 training. We will compare our results with (Cohen et al., 2019) and (Lecuyer et al., 2019) under the same standard deviation \u03c3. For base classifiers, ours and Lecuyer et al.\u2019s share the same base classifier with Laplace training noise, and Cohen et al.\u2019s uses the base classifier with Gaussian training noise. 6 CONCLUSION\nIn this paper, we combine the inequality from differential privacy and the classic Neyman-Pearson Lemma to resolve the challenging asymmetry of `1 metric and the mixed discrete-continuous property of the likelihood ratios under isotropic Laplace distributions. In addition, by comparing the high-dimensional case with a special edge case, we prove the tight `1 robustness guarantee for binary classification problems, and obtain the state-of-the-art certified accuracy in large scale experiments.\nThe establishment of `1 certificate via Laplace distributions and the prior result of `2 certificate via Gaussian distributions may be extended to a generic theorem for a general `p norm robustness certificate via the associated realization of the generalized Gaussian distribution, where the aforementioned results are special cases of the general scheme. The introduction of the mixed random variable analysis and `p geometry analysis may serve as a valuable extension of existing works towards such general goal. A PROOF OF LEMMA 1\nIn this section, we will prove that Neyman-Pearson Lemma holds with mixed random variable.\nWLOG, x = 0, X \u223c L (\u03bb) and Y \u223c L (\u03bb) +\u03b4. We will firstly introduce Neyman-Pearson Lemma, which plays an important role in our proof.\nLemma 3.1 (restated).LetX \u223c L (\u03bb) and Y \u223c L (\u03bb)+\u03b4. Let h : Rd \u2192 {0, 1} be any deterministic or random function. Given any \u03b2 \u2208 R, and S\u2032 \u2286 { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 = \u03b2 } :\n1. If S = { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 > \u03b2 } \u222a S\u2032, and P(h(X) = 1) \u2265 P(X \u2208 S) then P(h(Y ) = 1) \u2265 P(Y \u2208 S)\n2. If S = { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 < \u03b2 } \u222aS\u2032, and P(h(X) = 1) \u2264 P(X \u2208 S), then P(h(Y ) = 1) \u2264 P(Y \u2208 S) Proof of Lemma 3.1 First, notice that P(X \u2208 S) can be regarded as a mixed random variable. We want to prove that as long as we can choose a S\u2032 that satisfies P(X \u2208 S) \u2264 P(h(X) = 1), Neyman-Pearson Lemma can always hold.\nLet\u2019s first see what happens in the proof of Neyman-Pearson Lemma. Notice that X and Y are continuous variables, but X \u2208 S and Y \u2208 S can be regarded as mixed continuous-discrete event. Then we can choose a reasonable S\u2032 for X and Y . We will prove case 1 and the other one can be proved with similar method.\nP(h(Y ) = 1)\u2212 P(Y \u2208 S)\n= \u222b Rd h(1|z)\u00b5Y (z)dz\u2212 \u222b S \u00b5Y(z)dz\n=[ \u222b Sc h(1|z)\u00b5Y (z)dz + \u222b S h(1|z)\u00b5Y(z)dz]\u2212 [ \u222b S h(1|z)\u00b5Y(z)dz + \u222b S h(0|z)\u00b5Y(z)dz]\n= \u222b Sc h(1|z)\u00b5Y (z)dz\u2212 \u222b S h(0|z)\u00b5Y(z)dz\n\u2265t( \u222b Sc h(1|z)\u00b5X(z)dz\u2212 \u222b S h(0|z)\u00b5X(z)dz)\n=t([ \u222b Sc h(1|z)\u00b5X(z)dz + \u222b S h(1|z)\u00b5X(z)dz]\u2212 [ \u222b S h(0|z)\u00b5X(z)dz + \u222b Sc h(1|z)\u00b5X(z)dz]) =t(P(h(X) = 1)\u2212 P(X \u2208 S)) \u22650\n(4)\nThe first inequality holds due to the construction of mixed definition S. If z \u2208 S, \u00b5Y (z)\u00b5X(z) \u2265 t. If z \u2208 Sc, \u00b5Y (z)\u00b5X(z) \u2264 t. Compared with continuous set, the only difference appears in the equal sign.\nIt should be noted that P(X \u2208 S) and P(Y \u2208 S) should keep consistent, which means that they should have the same S\u2032. In this derivation, we can find that P (X \u2208 S) and P (Y \u2208 S) use the same set S\u2032 in Eqn. (4).\nNext, we will plug in the condition that X and Y are isotropic Laplaces.\nThen we just need to prove that{ z \u2208 Rd : \u00b5Y (z) \u00b5X(z) \u2264 t } \u21d0\u21d2 { z \u2208 Rd : \u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 \u2265 \u03b2 } When X and Y are isotropic Laplaces, the likelihood ratio turns out to be:\n\u00b5Y (z) \u00b5X(z) = exp(\u2212 1\u03bb\u2016z \u2212 \u03b4\u20161) exp(\u2212 1\u03bb\u2016z\u20161)\n= exp(\u2212 1 \u03bb (\u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161))\nBy choosing \u03b2 = \u2212\u03bb log(t), we can derive that\n\u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 \u2265 \u03b2 \u21d0\u21d2 \u00b5Y (z)\n\u00b5X(z) \u2264 t\n\u2016z \u2212 \u03b4\u20161 \u2212 \u2016z\u20161 \u2264 \u03b2 \u21d0\u21d2 \u00b5Y (z)\n\u00b5X(z) \u2265 t B PROOF OF THEOREM 1\nTheorem 1(restated) Let f : Rd \u2192 Y be deterministic or random function, and let \u223c L(\u03bb). Let g(x) = arg maxc P (f(x+ ) = c). Suppose PA, PB \u2208 [0, 1] are such that\nP (f(x+ ) = cA) \u2265 PA \u2265 PB \u2265 max c 6=cA P(f(x+ ) = c)\nThen g(x+ \u03b4) = g(x),\u2200\u2016\u03b4\u20161 \u2264 R, where\nR = max\n{ \u03bb\n2 log(PA/PB),\u2212\u03bb log(1\u2212 PA + PB)\n} (5)\nProof of Theorem 1 Denote T (x) = \u2016x \u2212 \u03b4\u20161 \u2212 \u2016x\u20161. Use Triangle Inequality we can derive a bound for T (x):\n\u2212 \u2016\u03b4\u20161 \u2264 T (x) \u2264 \u2016\u03b4\u20161 (6)\nPick \u03b21, \u03b22 such that there exists A\u2032 \u2286 {z : T (z) = \u03b21}, B\u2032 \u2286 {z : T (z) = \u03b22}, and P(X \u2208 {z : T (z) > \u03b21} \u222aA\u2032) = PA \u2264 P(f(X) = cA)) P(X \u2208 {z : T (z) < \u03b22} \u222aB\u2032) = PB \u2265 P(f(X) = cB) Define A := {z : T (z) > \u03b21} \u222aA\u2032\nB := {z : T (z) < \u03b22} \u222aB\u2032\nThus, apply Lemma 3.1, we have P(Y \u2208 A) \u2264 P(f(Y ) = cA) P(Y \u2208 B) \u2265 P(f(Y ) = cB)\n(7)\nThen consider P(Y \u2208 A) and P(Y \u2208 B) P(Y \u2208 A) = \u222b A [2\u03bb] \u2212d exp(\u2212\u2016x\u2212 \u03b4\u20161 \u03bb )dx\n= \u222b A [2\u03bb] \u2212d exp(\u2212\u2016x\u20161 \u03bb ) exp(\u2212T (x) \u03bb )dx \u2265 exp(\u2212\u2016\u03b4\u20161 \u03bb ) \u222b A [2\u03bb] \u2212d exp(\u2212\u2016x\u20161 \u03bb )dx = exp(\u2212\u2016\u03b4\u20161 \u03bb )PA\n(8)\nThe inequality is derived by Eqn.( 6). Similarly, we can get\nP(Y \u2208 B) = \u222b B [2\u03bb] \u2212d exp(\u2212\u2016x\u2212 \u03b4\u20161 \u03bb )dx\n= \u222b B [2\u03bb] \u2212d exp(\u2212\u2016x\u20161 \u03bb ) exp(\u2212T (x) \u03bb )dx \u2264 exp(\u2016\u03b4\u20161 \u03bb ) \u222b B [2\u03bb] \u2212d exp(\u2212\u2016x\u20161 \u03bb )dx = exp( \u2016\u03b4\u20161 \u03bb )PB\n(9)\nFirst, we would like to show that robustness can be guaranteed when R \u2264 \u03bb2 log(PA/PB).\nIf \u2016\u03b4\u20161 \u2264 \u03bb2 log(PA/PB), by Eqn. (7, 8, 9), we have P(f(Y ) = cA) \u2265 P(Y \u2208 A) \u2265 P(Y \u2208 B) \u2265 P(f(Y ) = cB)\nThen, we would like to show that robustness can be guaranteed whenR \u2264 \u2212\u03bb log(1\u2212PA+PB).\nFrom Eqn. (9), we know that P(Y \u2208 B) \u2264 exp(\u2016\u03b4\u20161\u03bb )PB . Besides, by applying Eqn. (9) in set Ac, we can get that P(Y \u2208 A) \u2265 1 \u2212 exp(\u2016\u03b4\u20161\u03bb )(1 \u2212 PA). So we can calculate that if \u2016\u03b4\u20161 \u2264 \u2212\u03bb log(1\u2212 PA + PB), we have\nP(f(Y ) = cA) \u2265 P(Y \u2208 A) \u2265 P(Y \u2208 B) \u2265 P(f(Y ) = cB)\nMoreover, by simple algebraic operation, we can derive that\u2212\u03bb log(1\u2212PA+PB) \u2265 \u03bb2 log(PA/PB) requires 1\u22122PA(1\u2212PA)\u2212 \u221a 1\u22124PA(1\u2212PA)\n2PA \u2264 PB \u2264\n1\u22122PA(1\u2212PA)+ \u221a\n1\u22124PA(1\u2212PA) 2PA .\nThe proof for Theorem 1 is finished. C PROOF OF THEOREM 2 AND THEOREM 3\nTheorem 2(restated) (binary case) Let f : Rd \u2192 Y be deterministic or random function, and let \u223c L(\u03bb). Let g(x) = arg maxc P (f(x+ ) = c). Suppose there are only two classes cA and cB , and PA \u2208 [ 12 , 1] s.t. P (f(x+ ) = cA) \u2265 PA Then g(x+ \u03b4) = g(x),\u2200\u2016\u03b4\u20161 \u2264 R, for\nR = \u2212\u03bb log[2(1\u2212 PA)] (10)\nProof of Theorem 2:\nIt is similar to the proof of Theorem 1. Pick \u03b23 such that there exists B\u2032 \u2286 {z : T (z) = \u03b23}, and\nP(X \u2208 {z : T (z) < \u03b23} \u222aB\u2032) = PB = P(f(X) = cB) Define\nS := {z : T (z) < \u03b23} \u222aB\u2032\nSo we also have P(X 6\u2208 S) = PA = P(f(X) = cA). Plug into Lemma 3.1, we can get P(Y 6\u2208 S) \u2264 P(f(Y ) = cA) P(Y \u2208 S) \u2265 P(f(Y ) = cB)\nUsing a similar method as Eqn. (9), we can get that\nP(Y \u2208 S) \u2264 exp(\u2016\u03b4\u20161 \u03bb )PB\nSince we have PB = P(f(X) = cB) = 1\u2212 PA \u2264 1\u2212 PA\nThus, if \u2016\u03b4\u20161 \u2264 R = \u2212\u03bb log[2(1\u2212 PA)], it holds that\nP(Y \u2208 S) \u2264 exp(\u2016\u03b4\u20161 \u03bb )PB\n\u2264 exp(\u2212\u03bb log[2(1\u2212 PA)] \u03bb )(1\u2212 PA) = 1\n2\nThat is to say, P(f(Y ) = cA) \u2265 P(Y 6\u2208 S) \u2265 12 \u2265 P(Y \u2208 S) \u2265 P(f(Y ) = cB). The proof for Theorem 2 is finished.\nTheorem 3(restated) (tight bound in binary case) In the same setting as Theorem 2, assume PA + PB \u2264 1 and PA \u2265 12 . \u2200R\n\u2032 > \u2212\u03bb log[2(1 \u2212 PA)], \u2203 base classifier f\u2217 and perturbation \u03b4\u2217 with g\u2217(x) = arg maxc P (f \u2217(x+ ) = c) and \u2016\u03b4\u20161 = R\u2032, s.t. g\u2217(x) 6= g\u2217(x+ \u03b4\u2217).\nProof of Theorem 3: Here, we first set \u03b4 = (\u2016\u03b4\u20161, 0, . . . , 0). For simplification, we denote \u03b4 = \u2016\u03b4\u20161.And define\nA := { z : |z \u2212 \u03b4| \u2212 |z| \u2265 max{\u03b4 + 2\u03bb log [ 2 ( 1\u2212 PA )] ,\u2212\u03b4} } Then, we can calculate that\nP(X \u2208 A) = Px(|x\u2212 \u03b4| \u2212 |x| \u2265 max{\u03b4 + 2\u03bb log[2(1\u2212 PA)],\u2212\u03b4})\n= \u222b \u2212\u03bb log[2(1\u2212PA)] \u2212\u221e 1 2\u03bb exp (\u2212|x| \u03bb )dx\n= 1\u2212 \u222b \u221e \u2212\u03bb log[2(1\u2212PA)] 1 2\u03bb exp ( x \u03bb )dx\n= PA\n(11)\nwhere x \u223c 12\u03bb exp (\u2212 |x| \u03bb ), \u03b4 = \u2016\u03b4\u20161 . Notice that if \u03b4 + 2\u03bb log[2(1 \u2212 PA)] \u2264 \u2212\u03b4, we will get the integral equation by choosing S\u2032. With Eqn. (11), we have\nP(X \u2208 A) = PA \u2264 P(f(X) = cA) (12)\nThus, plug Eqn. (12) into the results of Lem. 3.1, we have\nP(Y \u2208 A) \u2264 P(f(Y ) = cA) (13)\nAlso, since Y = X + \u03b4, it can be derived that P(Y \u2208 A) = \u222b \u2212\u03bb log[2(1\u2212PA)]\u2212\u03b4 \u2212\u221e 1 2\u03bb exp (\u2212|x| \u03bb )dx (14)\nHere we use the consistency of X \u2208 A and Y \u2208 A. Since Y can be regarded as an offset of X , the integral limit should be translated into the same length. So, if \u2016\u03b4\u20161 = \u03b4 \u2264 \u2212\u03bb log[2(1 \u2212 PA)], by Eqn. (7) and Eqn. (14), we have\nP(f(Y ) = cA) \u2265 P(Y \u2208 A) \u2265 1\n2\nThis means that the results we get in binary case is a tight bound, and the worst-case \u03b4 appears when \u03b4 = (\u03b4, 0, . . . , 0). Furthermore, if we slightly enlarge \u03b4, there would be a case that the robustness is destroyed.\nThe proof for Theorem 3 is finished. D WHY LAPLACE NOISE INSTEAD OF GAUSSIAN\nIn this section, we theoretically analyze the certification capabilities of Gaussian and Laplace noises. We will show that, given the same base classifier f the parameter of Laplace distributions \u03bb is less sensitive than the parameter of Gaussian distributions \u03c3. Given a base classifier f , where\nf(x) = { cA |x| \u2264 1 cB o.w.\nand two random smoothing functions\ng1(x) = arg max c\nP(f(x+ ) = c), \u223c L(0, \u03bb),\ng2(x) = arg max c\nP(f(x+ ) = c), \u223c N (0, \u03c32),\nwe aim to prove that Laplace noises will better protect the original prediction than Gaussian noises. Formally, we compare their Rectified Optional Parameter Space (ROPS), defined as \u039b = { \u221a 2\u03bb : g1(x;\u03bb) = f(x)} and \u03a3 = {\u03c3 : g2(x;\u03c3) = f(x)}. Note that the rectified term \u221a\n2 is due to the fact that \u03c3 = \u221a 2\u03bb yield the same variance. Essentially, ROPS indicates the feasible region where the smoothing distribution does not negatively impact the base classifier, thus measuring the sensitivity of the smoothing distribution (the larger the better).\nFirst, we would like to compare its prediction on a given point (x, f(x)) = (0, cA). We have\ng1(0) = cA \u21d0\u21d2 P(f(0 + ) = cA) \u2265 1 2 \u21d0\u21d2 P(| | \u2264 1) = 1\u2212 exp(\u2212 1 \u03bb ) \u2265 1 2 \u21d0\u21d2 \u03bb \u2264 1 log 2 ,\ng2(0) = cA \u21d0\u21d2 P(f(0 + ) = cA) \u2265 1 2 \u21d0\u21d2 P(| | \u2264 1) = 2\u03a6( 1 \u03c3 )\u2212 1 \u2265 1 2 \u21d0\u21d2 \u03c3 \u2264 1 \u03a6\u22121(3/4) .\nSince \u221a\n2 log 2 > 1 \u03a6\u22121(3/4) , Laplace noises have a larger ROPS than Gaussian noises at the point x = 0.\nThe analysis can be further extended in two cases.\nFirst, if we have x 6= 0, what is the corresponding ROPS that leads to the desired result (g(x) = f(x))? We show in Fig. 10 that we will have a larger ROPS under Laplace noises.\nSecond, if we have a fixed x but fixed a desired certified radius, what is the corresponding ROPS? We show in Fig. 11 that Laplace noises again have a larger ROPS.\nWe empirically validate this finding with ResNet110 on CIFAR-10. The resulting smoothed model has 24.8% clean accuracy under a Laplace noise, and 23.7% clean accuracy under a Gaussian noise (with the same variance as the Laplace noise). Here the accuracy is computed with respect to predictions of the base classifier instead of the labels (to illustrate how the smoothing impacts the predictions).", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Control regularization for reduced variance reinforcement learning , Authors: Richard Cheng Abhinav Verma... , 1905\n2.  Title: Addressing function approximation error in actorcritic methods , Authors: Scott Fujimoto Herke van Hoof... , 2018\n3.  Title: Imformation asymmetry in kl-regularized rl , Authors: Alexandre Galashov Siddhant Jayakumar... , 2019\n4.  Title: Continuous control with deep reinforcement learning , Authors: Timothy Lillicrap Jonathan Hunt... , 2016\n5.  Title: Continuous control with deep reinforcement learning , Authors: Timothy P Lillicrap Jonathan J Hunt... , 2015\n6.  Title: Td-regularized actor-critic methods , Authors: Simone Parisi Voot Tangkaratt... , 2019\n7.  Title: Trust region policy optimization , Authors: John Schulman Sergey Levine... , 2015\n8.  Title: High-dimensional continuous control using generalized advantage estimation , Authors: John Schulman Philipp Moritz... , 2016\n9.  Title: Proximal policy optimization algorithms , Authors: John Schulman Filip Wolski... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., L2 regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvements on the task performance, and the improvement is typically more significant when the task is more difficult. We also compare with the widely used entropy regularization and find L2 regularization is generally better. Our findings are further confirmed to be robust against the choice of training hyperparameters. We also study the effects of regularizing different components and find that only regularizing the policy network is typically the best option. We hope our study provides guidance for future practices in regularizing policy optimization algorithms.", "Regularization in Deep RL. Conventional regularization methods have rarely been applied in deep RL. One rare case of such use is in Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), where Batch Normalization is applied to all layers of the actor \u00b5 network and layers of the critic Q network prior to the action input, and L2 regularization is applied to the critic Q network because it tends to have overestimation bias (Fujimoto et al., 2018). Some recent studies have developed more complicated regularization approaches to continuous control tasks. Cheng et al. (2019) regularizes the stochastic action distribution \u03c0(a|s) using a suboptimal control prior. The regularization weight at a given state is adjusted based on the temporal difference (TD) error. The larger the TD error, the more the action distribution moves towards the prior. Galashov et al. (2019) introduces a default policy that receives limited information as a regularizer. The information asymmetry between the behavior policy and the default policy helps to accelerate convergence and improve performance. Parisi et al. (2019) introduces TD error regularization to penalize inaccurate value estimation and Generalized Advantage Estimation (GAE) (Schulman et al., 2016) regularization to penalize GAE variance. However, most of these regularizations are rather complicated (Galashov et al., 2019), specifically designed for certain algorithms (Parisi et al., 2019), or need prior information (Cheng et al., 2019). Also, these techniques consider regularizing the output\nof the network, while conventional regularization methods mostly directly regularize the parameters.", "The authors present that regularization methods mostly help to improve the agents' performance in terms of final scores. Specifically, they show that direct regularizations on model parameters, such as the standard case of L2 or L1 regularization, generally improve the agent performance. They also show that these regularizations, in their study, is more proper than entropy regularization. The authors also show that, in the presence of such regularizations, the learning algorithms become less sensitive to the hyperparameters. \n\nFew comments:\n1) The paper is well written and easy to follow. I appreciate it. I found the writing of the paper has a bit of repetition. The authors might find it slightly more proper to remove some of the repetitions (e.g. section 4.2)\n2) I would suggest rephrasing the last two sentences of the second paragraph in related work: \"Also, these techniques consider ...\". Regularizing the output also regularizes the parameters, I think the authors' point was \"directly regularize\" the parameters. \n3) In the \"Entropy Regularization\" part of section 3, I guess the Hs has not been defined. \n4) Repeated \"the\" in the last paragraph of section 4.1 (despite it already incorporates the the maximization of)\n5) The authors used the term \"not converge\" multiple times. While it is hard from the plots to see whether the series converges or not, I have a strong feeling that by this term the authors mean the algorithm does not converge to a resealable solution rather than being divergent up to a bandwidth. Maybe clarifying would be helpful.\n6) In section 5, the authors study the sensitivity to the hyperparameters. In this section, I had a hard time to understand the role of term 3\n\"BN and dropout hurts on-policy algorithms but can bring improvement only for the off-policy SAC algorithm.\" Does it mean that deploying BN, results in a more sensitive algorithm? or it means that the performance degrades (which is a different topic than section 5 is supposed to serve)?\n7) In section 7, the authors put out a hypothesis \"\nHowever, there is still generalization between samples: the agents are only trained on the limited\" but the provided empirical study might not fully be considered to be designed to test this hypothesis. In order to test this hypothesis, the author might be interested in training the models with bigger sample sizes, more training iteration, different function classes, and more fitting in order to test this hypothesis.\n\n3: Weak Reject"], "input": "1 INTRODUCTION\nRegularization, typically referring to methods for preventing overfitting, is a key technique in successfully training a neural network. Perhaps the most widely recognized regularization methods in deep learning are L2 regularization (also known as weight decay) and dropout (Srivastava et al., 2014). Those techniques are standard practices in supervised learning tasks from many domains. Major tasks in computer vision, e.g., image classification (He et al., 2016; Huang et al., 2017), object detection (Ren et al., 2015; Redmon et al., 2016), all use L2 regularization as a default option. In natural language processing, for example, the Transformer model (Vaswani et al., 2017) uses dropout. and the recently popular BERT model (Devlin et al., 2018) uses L2 regularization. In fact, it is very rare to see state-of-the-art neural models trained without any regularization in a supervised setting.\nHowever, in deep reinforcement learning (RL), those conventional regularization methods are largely absent or underutilized in past research, possibly because in most cases we are maximizing the return on exactly the same task as in training. In other words, there is a lack of generalization gap from the training environment to the test environment (Cobbe et al., 2018). Moreover, researchers in deep RL focus more on high-level algorithm designs, which is more closely related to the field of reinforcement learning, and focus less on network training techniques such as regularization. For popular policy optimization algorithms like Asynchronous Advantage Actor-Crtic (A3C) (Mnih et al., 2016), Trust Region Policy Optimization (TRPO) (Schulman et al., 2015), Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Soft Actor Critic (SAC) (Haarnoja et al., 2018), conventional regularization methods were not considered. Even in popular codebases such as the OpenAI Baseline (Dhariwal et al., 2017), L2 regularization and dropout were not incorporated.\nInstead, the most commonly used regularization in the RL community, is an \u201centropy regularization\u201d term that penalizes the high-certainty output from the policy network, to encourage more exploration during the training process and prevent the agent from overfitting to certain actions. The entropy regularization was first introduced by Williams & Peng (1991) and now used by many contemporary algorithms (Mnih et al., 2016; Schulman et al., 2017; Teh et al., 2017; Farebrother et al., 2018).\nIn this work, we take an empirical approach to questioning the conventional wisdom of not using common regularizations. We study agent\u2019s performance on the current task (the environment which the agent is trained on), rather than its generalization ability to different environments as many recent works (Zhang et al., 2018a; Zhao et al., 2019; Farebrother et al., 2018; Cobbe et al., 2018). We specifically focus our study on policy optimization methods, which are becoming increasingly popular and have achieved top performance on various tasks. We evaluate four popular policy optimization algorithms, namely SAC, PPO, TRPO, and the synchronous version of Advantage Actor Critic (A2C), on multiple continuous control tasks. A variety of conventional regularization techniques are considered, including L2/L1 weight regularization, dropout, weight clipping (Arjovsky et al., 2017) and Batch Normalization (BN) (Ioffe & Szegedy, 2015). We compare the performance of these regularization techniques to that without regularization, as well as the entropy regularization.\nSurprisingly, even though the training and testing environments are the same, we find that many of the conventional regularization techniques, when imposed to the policy networks, can still bring up the performance, sometimes significantly. Among those regularizers, L2 regularization, perhaps the most simple one, tends to be the most effective for all algorithms and generally outperforms entropy regularization. L1 regularization and weight clipping can boost performance in many cases. Dropout and Batch Normalization tend to bring improvements only on off-policy algorithms. Additionally, all regularization methods tend to be more effective on more difficult tasks. We also verify our findings with a wide range of training hyperparameters and network sizes, and the result suggests find that imposing proper regularization can sometimes save the effort of tuning other training hyperparameters. Finally, we study which part of the policy optimization system should be regularized, and conclude that generally only regularizing the policy network suffices, as imposing regularization on value networks usually does not help. Our results also show that neural network training techniques such as regularization, can be as important as high-level reinforcement learning algorithms in terms of boosting performance. Our main contributions can be summarized as follows:\n\u2022 We provide the first comprehensive study of common regularization methods in policy optimization algorithms, which have been largely ignored in the RL literature.\n\u2022 We find conventional regularizations can often be very effective in improving the performance on continuous control tasks, espcially on harder ones. Remarkably, the most simpleL2 regularization generally performs better than the more widely used entropy regularization. BN and dropout can only help in off-policy algorithms.\n\u2022 We experiment with multiple randomly sampled training hyperparameters for each algorithm and confirm our findings still hold. The result also suggests that proper regularization can sometimes ease the hyperparameter tuning process.\n\u2022 We study which part of the network(s) should be regularized. The key lesson is to regularize the policy network but not the value network. 3 REGULARIZATION METHODS\nThere are in general two kinds of common approaches for imposing regularization, one is to discouraging complex models (e.g., weight regularization, weight clipping), and the other is to inject certain kind of noise in the activations (e.g., dropout and Batch Normalization). Here we briefly introduce those regularization methods we investigate in our experiments.\nL2 / L1 Weight Regularization. Large weights are usually believed to be a sign of overfitting to the training data, since the function it represents tend to be complex. One can encourage small weights by adding a loss term penalizing the norm of the weight vector. Suppose L is the original empirical loss we want to minimize. SGD updates the model on a mini-batch of training samples: \u03b8i \u2190 \u03b8i\u2212 \u03b7 \u00b7 \u2202L\u2202\u03b8i , where \u03b7 is the learning rate. When applying L2 regularization, we add an additional L2-norm squared loss term 12\u03bb||\u03b8|| 2 2 to the training objective. Thus the SGD step becomes \u03b8i \u2190 \u03b8i \u2212 \u03b7 \u2202L\u2202\u03b8i \u2212 \u03b7 \u00b7 \u03bb \u00b7 \u03b8i. Similarly, in the case of L1 weight regularization, the additional loss term is \u03bb||\u03b8||1, and the SGD step becomes \u03b8i \u2190 \u03b8i \u2212 \u03b7 \u00b7 \u2202L\u2202\u03b8i \u2212 \u03b7 \u00b7 \u03bb \u00b7 sign(\u03b8i).\nWeight Clipping. Weight clipping is an extremely simple idea: after each gradient update step, each individual weight is clipped to range [\u2212c, c], where c is a hyperparameter. This could be formally described as \u03b8i \u2190 max(min(\u03b8i, c),\u2212c). In Wasserstein GANs (Arjovsky et al., 2017), weight clipping is used to satisfy the constraint of Lipschitz continuity. This plays an important role in stabilizing the training of GANs (Goodfellow et al., 2014), which were notoriously hard to train and often suffered from \u201cmode collapse\u201d before. Weight clipping could also be seen as a regularizor since it drastically reduce the complexity of the model space, by preventing any weight\u2019s magnitude from being larger than c.\nDropout. Dropout (Srivastava et al., 2014) is one of the most successful regularization techniques developed specifically for neural networks. The idea is to randomly deactivate a certain percentage of neurons during training; during testing, a rescaling operation is taken to ensure the scale of the activations is the same as training. One explanation for its effectiveness in reducing overfitting is they can prevent \u201cco-adaptation\u201d of neurons. Another explanation is that dropout acts as a implicit model ensemble method, because during training a different model is sampled to fit each mini-batch of data.\nBatch Normalization. Batch Normalization (BN) (Ioffe & Szegedy, 2015) is invented to address the problem of \u201cinternal covariate shift\u201d, and it does the following transformation: z\u0302 = zin\u2212\u00b5B\u221a\n\u03c32B+ ; zout =\n\u03b3z\u0302 + \u03b2, where \u00b5B and \u03c3B are the mean and standard deviation values of input activations over B, \u03b3\nand \u03b2 are trainable affine transformation parameters (scale and shift) which provides the possibility of linearly transforming normalized activations back to any scales. BN turns out to be able to greatly accelerate the convergence and bring up the accuracy. It has become a standard component, especially in convolutional networks. BN also \u201cacts as a regularizer\u201d (Ioffe & Szegedy, 2015): since the statistics \u00b5B and \u03c3B are dependent on the current batch, BN subtracts and divides different values in each iteration. This randomness can encourage subsequent layers to be robust to such variation of input.\nEntropy Regularization. In a policy optimization framework, the policy network is used to model a conditional distribution over actions, and entropy regularization is widely used to prevent the learned policy from overfitting to one or some of the actions. More specifically, in each step, the output distribution of the policy network is penalized to have a high entropy. Policy entropy is calculated at each step as Hsi = \u2212Eai\u223c\u03c0(ai|si) log \u03c0(ai|si), where (si, ai) is the state-action pair. Then the per-sample entropy is averaged within the batch of state-action pairs to get the regularization term LH = 1\nN \u2211 si Hsi . A coefficient \u03bb is also needed, and \u03bbL H is added to the policy objective J(\u03b8)\nto be maximized during policy updates. Entropy regularization also encourages exploration due to increased stochasticity in actions, leading to better performance in the long run. 4 EXPERIMENTS\n 4.1 SETTINGS\nAlgorithms. We evaluate the six regularization methods introduced in Section 3 using four popular policy optimization algorithms, namely, A2C (Mnih et al., 2016), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018). The first three algorithms are on-policy while the last one is off-policy. For the first three algorithms, we adopt the code from OpenAI Baseline (Dhariwal et al., 2017), and for SAC, we use the official implementation at (Haarnoja, 2018).\nTasks. The algorithms with different regularizations are tested on nine continuous control tasks: Hopper, Walker, HalfCheetah, Ant, Humanoid, and HumanoidStandup from the MuJoCo simulation environment (Todorov et al., 2012); Humanoid, AtlasForwardWalk, and HumanoidFlagrun from the more challenging RoboSchool (OpenAI) suite. Among the MuJoCo tasks, agents for Hopper, Walker, and HalfCheetah are easier to learn, while Ant, Humanoid, HumanoidStandup are relatively harder (larger state-action space, more training examples). The three Roboschool tasks are even harder than all the MuJoCo tasks as they require more timesteps to converge. To better understand how different regularization methods work on different difficulties, we roughly categorize the first three environments as \u201ceasy\u201d tasks and the last six as \u201chard\u201d tasks.\nTraining. On MuJoCo tasks, we keep all training hyperparameters unchanged as in the codebase adopted. Since hyperparameters for the RoboSchool tasks are not included in the original codebase, we briefly tune the hyperparameters for each algorithm before we apply any regularization (more details in Appendix D). For details on regularization strength tuning, please refer to Appendix B.\nThe results shown in this section are obtained by only regularizing the policy network, and a further study on this issue will be presented in Section 6. We run each experiment independently with five random seeds, then use the average return over the last 100 episodes as the final result. Each regularization method is evaluated independently, with other regularizations turned off. We refer to the result without any regularization methods as the baseline. For BN and dropout, we use its training mode when we update the network, and test mode when sampling trajectories.\nNote that entropy regularization is still applicable for SAC, despite it already incorporates the maximization of entropy in the reward term. In our experiments, we add the entropy regularization term to the policy optimization loss function in equation (12) of the original paper (Haarnoja et al., 2018). Meanwhile, policy network dropout is not applicable to TRPO because during policy updates, different neurons in the old and new policy networks are dropped out, causing different shifts in the old and new action distributions given the same state, which then causes the trust region constraint to be violated. In this case, the algorithm fails to perform any policy update from network initialization. 4.2 RESULTS.\nTraining curves. We plot the training curves from four environments (rows) in Figure 1, on four algorithms (columns). Figures for the rest five environments are deferred to Appendix C. In the figure, different colors are used to denote different regularization methods, e.g., black is the baseline method. Shades are used to denote \u00b11 standard deviation range. Notably, these conventional regularizations can frequently boost the performance across different tasks and algorithms, demonstrating that a study on the regularization in deep RL is highly demanding. Interestingly, in some cases where the baseline (with the default hyperparameters in the codebase) does not converge to a reasonable solution, e.g., A2C Ant, PPO Humanoid, imposing some regularization can make the training converge to a high level. Another observation is that BN always significantly hurts the baseline for on-policy algorithms. The reason will be discussed later. For the off-policy SAC algorithm, dropout and BN sometimes bring large improvements on hard tasks like AtlasForwardWalk and RoboschoolHumanoid.\nHow often do regularizations help? To quantitatively measure the effectiveness of the regularizations on each algorithm across different tasks, we define the condition when a regularization is said to \u201cimprove\u201d upon the baseline in a certain environment. Denote the baseline mean return over five seeds on an environment as \u00b5env,b, and the mean and standard deviation of the return obtained with a certain regularization method over five seeds as \u00b5env,r and \u03c3env,r. We say the performance is \u201cimproved\u201d by the regularization if \u00b5env,r \u2212 \u03c3env,r > max(\u00b5env,b, T (env)), where T (env) is the minimum return threshold of an environment. The threshold serves to ensure the return is at least in a reasonable level. We set the threshold to be 105 for HumanoidStandup and 103 for all other tasks.\nThe result is shown in Table 1. Perhaps the most significant observation is that L2 regularization is the most often to improve upon the baseline. A2C algorithm is an exception, where entropy regularization is the most effective. L1 regularization behaves similar to L2 regularization, but is outperformed\nby the latter. Weight clipping\u2019s usefulness is highly dependent on the algorithms and environments. Despite in total it only helps at 30.6% times, it can sometimes outperform entropy regularization by a large margin, e.g., in TRPO Humanoid and PPO Humanoid as shown in Figure 1. BN is not useful at all in the three on-policy algorithms (A2C, TRPO, and PPO). Dropout is not useful in A2C at all, and sometimes helps in PPO. However, BN and dropout can be useful in SAC. All regularization methods generally improve more often when they are used on harder tasks, perhaps because for easier ones the baseline is often sufficiently strong to reach a high performance.\nIt should be noted that under our definition, not \u201cimproving\u201d does not indicate the regularization is hurting the performance. If we define \u201churting\u201d as \u00b5env,r + \u03c3env,r < \u00b5env,b (the reward minimum threshold is not considered here), then total percentage of hurting is 0.0% for L2, 2.8% for L1, 5.6% for weight clipping, 44.4% for dropout, 66.7% for BN, and 0.0% for entropy. In other words, under our parameter tuning range, L2 and entropy regularization never hurt with appropriate strengths. For BN and dropout, we also note that almost all hurting cases are in on-policy algorithms, except one case for BN in SAC. If we define \u201churting\u201d as \u00b5env,r < \u00b5env,b, the total percentage of hurting is 11.1% for L2, 16.7% for L1, 22.2% for weight clipping, 55.5% for dropout, 72.2% for BN, and 16.7% for entropy. In sum, all regularizations in our study very rarely hurt the performance except for BN/dropout in on-policy methods.\nRanking all regularizations. Furthermore, to better compare their relative effectiveness, we rank the performance of all the regularization methods, together with the baseline, for each algorithm and task, and present the average ranks and the standard deviation of ranks in Table 2 and Table 3. Here, the ranks of returns among different regularizers are collected for each environment (after averaging over 5 random seeds), and then the mean and standard deviations are calculated. From Table 2, we observe that, except for BN and dropout in on-policy algorithms, all regularizations on average outperform baselines. Again, L2 regularization is the strongest in most cases. Other similar observations can be made as in Table 1. For every algorithm, baseline ranks lower on harder tasks than easier ones; in total, it ranks 3.50 for easier tasks and 5.25 for harder tasks. This indicates that regularization is more effective when the tasks are harder. 5 ROBUSTNESS WITH HYPERPARAMETER CHANGES\nIn the previous section, the experiments are conducted mostly with the default hyperparameters in the codebase we adopt, which are not necessarily optimized. For example, PPO Humanoid baseline performs poorly using default hyperparameters, not converging to a reasonable solution. Meanwhile, it is known that RL algorithms are very sensitive to hyperparameter changes (Henderson et al., 2018). Thus, our findings can be vulnerable to such variations. To further confirm our findings, we evaluate the regularizations under a variety of hyperparameter settings. For each algorithm, we sample five hyperparameter settings for the baseline and apply regularization on each of them. Due to the heavy computation budget, we only evaluate on five MuJoCo environments: Hopper, Walker, Ant, Humanoid, and HumanoidStandup. Under our sampled hyperparameters, poor baselines are mostly significantly improved. For further details on sampling and training curves, please refer to Appendix E and K.\nSimilar to Table 2 and Table 3, the results of regularization ranks are shown in Table 4 and Table 5. For results of improvement percentages similar to Table 1, please refer to Appendix F. We note that our main findings still hold: 1) the regularizations can improve more effectively on baselines with harder tasks; 2) L2 is still generally the best regularization method; 3) BN and dropout hurts on-policy algorithms but can bring improvement only for the off-policy SAC algorithm. Interestingly, different from previous section, L1 regularization and weight clipping tend to be more effective than\nthe entropy regularization. The gaps between entropy and L2, L1 and weight clipping are even larger for harder tasks.\nTo better visualize the robustness against change of hyperparameters, we show the result when a single hyperparameter is varied in Figure 2. We note that the certain regularizations can consistently improve the baseline with different hyperparameters. In these cases, proper regularizations can ease the hyperparameter tuning process, as they can bring up the performance of baselines with suboptimal hyperparameters to be even higher than baselines with better hyperparameters.\nWe also analyze regularizations\u2019 effect with different network width/depths in Figure 3. There are several observations we can draw: 1) The baseline performance can be either increasing, decreasing or staying roughly the same when the network increases depth/width. 2) Certain regularizations can help with various widths or depths, demonstrating their robustness against these hyperparameters and ability to ease hyperparameter tuning. 3) Regularizations do not necessarily bring larger improvement when the network sizes are bigger, contrary to what we might expect: larger networks may suffer more from overfitting and thus regularization can help more. As an example, L2 sometimes helps more with thinner network (TRPO Ant), and sometimes more with wider network (PPO HumanoidStandup). 6 POLICY AND VALUE NETWORK REGULARIZATION\nOur experiments in previous sections only impose regularization on the policy network. To investigate the relationship between policy and value network regularization, we evaluate four options: 1) no regularization, and regularizing 2) policy network, 3) value network, 4) policy and value networks. For 2) and 3) we tune the regularization strengths independently and then use the appropriate ones for 4) (more details in Appendix B). We evaluate all four algorithms on the six MuJoCo tasks and present the percentage of tasks where we obtain improvement in Table 6. Note that entropy regularization is not applicable to the value network. For detailed training curves, please refer to Appendix L.\nIt can be seen that generally, only regularizing the policy network tends to be the most effective for almost all algorithms and regularizations. Regularizing the value network alone does not bring as significant performance improvement as regularizing the policy network alone. Though regularizing both is better than regularizing value network alone, it is still worse than only regularizing the policy network. 7 DISCUSSION AND CONCLUSION\nWhy does regularization benefit policy optimization? In RL, we are typically training and evaluating on the same environment, i.e., there is no generalization gap across different environments. However, there is still generalization between samples: the agents is only trained on the limited trajectories it has experienced, which cannot cover the whole state-action space of the environment. A successful policy needs to generalize from seen samples to unseen ones, which potentially makes regularization necessary in RL. This might also explain why regularization could be more helpful on harder tasks, which have larger state space. In this case, the portion of the space that have appeared in training tends to be smaller, and overfitting to this smaller portion of space would cause more serious issues, in which case regularizations may help. Some detailed analysis are provided in Appendix G.\nWhy do BN and dropout work only with off-policy algorithms? One major finding in our experiments is BN and dropout can sometimes improve on the off-policy algorithm SAC, but mostly would hurt on-policy algorithms. There are two possible reasons for this: 1) for both BN and dropout, training mode is used to train the network, and testing mode is used to sample actions during interaction with the environment, leading to a discrepancy between the sampling policy and optimization policy (the same holds if we always use training mode). For on-policy algorithms, if such discrepancy is large, it can cause severe off-policy issues, which hurts the optimization process or even crashes it. For off-policy algorithms, this discrepancy is not an issue since they naturally accept off-policy data. 2) Batch Normalization layers can be sensitive to input distribution shifts, since the mean and std statistics depend heavily on the input, and if the input distribution changes too quickly in training, the mapping functions of BN layers can change quickly too, and it can possibly destabilize training. One evidence for this is that in supervised learning, when transferring a ImageNet pretrained model to other vision datasets, sometimes the BN layers are fixed (Yang et al., 2017) and only other layers are trained. In on-policy algorithms, we always use the samples generated from the latest policy; in off-policy algorithms, the sample distributions are relatively slow-changing since we always draw from the whole replay buffer which holds cumulative data. The faster-changing input distribution for on-policy algorithms could be harmful to BN. Previously, BN has also been shown to be effective in Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), an off-policy algorithm.\nIn summary, we conducted the first comprehensive study of regularization methods with multiple policy optimization algorithms on continuous control benchmarks. We found that L2 regularization,\ndespite being largely ignored in prior literature, is effective in improving performance, even more than the widely used entropy regularization. BN and dropout could also be useful but only on off-policy algorithms. Our findings were confirmed with multiple hyperparameters. Further experiments have shown that generally the best practice is to regularize the policy network alone but not the value network or both. A POLICY OPTIMIZATION ALGORITHMS\nThe policy optimization family of algorithms is one of the most popular methods for solving reinforcement learning problems. It directly parameterizes and optimizes the policy to gain more cumulative rewards. Below, we give a brief introduction to the algorithms we evaluate in our work.\nA2C. Sutton et al. (2000) developed a policy gradient to update the parametric policy in a gradient descent manner. However, the gradient estimated in this way suffers from high variance. Advantage Actor Critic (A3C) (Mnih et al., 2016) is proposed to alleviate this problem by introducing a function approximator for values and replacing the Q-values with advantage values. A3C also utilizes multiple actors to parallelize training. The only difference between A2C and A3C is that in a single training iteration, A2C waits for parallel actors to finish sampling trajectories before updating the neural network parameters, while A3C updates in an asynchronous manner.\nTRPO. Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) proposes to constrain each update within a safe region defined by KL divergence to guarantee policy improvement during training. Though TRPO is promising at obtaining reliable performance, approximating the KL constraint is quite computationally heavy.\nPPO. Proximal Policy Optimization (PPO) (Schulman et al., 2017) simplifies TRPO and improves computational efficiency by developing a surrogate objective that involves clipping the probability ratio to a reliable region, so that the objective can be optimized using first-order methods.\nSAC. Soft Actor Critic (SAC) (Haarnoja et al., 2018) optimizes the maximum entropy objective in reward (Ziebart et al., 2008), which is different from the objective of the on-policy methods above. SAC combines soft policy iteration, which maximizes the maximum entropy objective, and clipped double Q learning (Fujimoto et al., 2018), which prevents overestimation bias, during actor and critic updates, respectively.\nB IMPLEMENTATION AND TUNING FOR REGULARIZATION METHODS\nAs mentioned in the paper, in Section 4 we only regularize the policy network; in Section 6, we investigate regularizing both policy and value networks.\nFor L1 and L2 regularization, we add \u03bb|\u0307|\u03b8||1 and \u03bb2 |\u0307|\u03b8|| 2 2, respectively, to the loss of policy network or value network of each algorithm (for SAC\u2019s value regularization, we apply regularization only to the V network instead of also to the two Q networks). L1 and L2 loss are applied to all the weights of the policy or value network. For A2C, TRPO, and PPO, we tune \u03bb in the range of [1e\u2212 5, 5e\u2212 5, 1e\u2212 4, 5e\u2212 4] for L1 and [5e\u2212 5, 1e\u2212 4, 5e\u2212 4, 1e\u2212 3] for L2. For SAC, we tune \u03bb in the range of [5e\u2212 4, 1e\u2212 3, 5e\u2212 3, 1e\u2212 2] for L1 and [1e\u2212 3, 5e\u2212 3, 1e\u2212 2, 5e\u2212 2] for L2. For weight clipping, the OpenAI Baseline implementation of the policy network of A2C, TRPO, and PPO outputs the mean of policy action from a two-layer fully connected network (MLP). The log standard deviation of the policy action is represented by a standalone trainable vector. We find that when applied only to the weights of MLP, weight clipping makes the performance much better than when applied to only the logstd vector or both. Thus, for these three algorithms, the policy network weight clipping results shown in all the sections above come from clipping only the MLP part of the policy network. On the other hand, in the SAC implementation, both the mean and the log standard deviation come from the same MLP, and there is no standalone log standard deviation vector. Thus, we apply weight clipping to all the weights of the MLP. For all algorithms, we tune the policy network clipping range in [0.1, 0.2, 0.3, 0.5]. For value network, the MLP produces a single output of estimated value given a state, so we clip all the weights of the MLP. For A2C, TRPO, and PPO, we tune the clipping range in [0.1, 0.2, 0.3, 0.5]. For SAC, we only clip the V network and do not clip the two Q networks for simplicity. We tune the clipping range in [0.3, 0.5, 0.8, 1.0] due to its weights having larger magnitude.\nFor Batch Normalization/dropout, we apply it before the activation function of each hidden layer/immediately after the activation function. When the policy or the value network is performing update using minibatches of trajectory data or minibatches of replay buffer data, we use the train mode of regularization and update the running mean and standard deviation. When the policy is sampling trajectory from the environment, we use the test mode of regularization and use the existing running mean and standard deviation to normalize data. For Batch Normalization/dropout on value network, only training mode is applied since value network does not participate in sampling trajectories. Note that adding policy network dropout on TRPO causes the KL divergence constraint Es\u223c\u03c1\u03b8old [DKL (\u03c0\u03b8old(\u00b7|s)\u2016\u03c0\u03b8(\u00b7|s))] \u2264 \u03b4 to be violated almost every time during policy network update. Thus, policy network dropout causes the training to fail on TRPO, as the policy network cannot be updated.\nFor entropy regularization, we add \u2212\u03bbLH to the policy loss. \u03bb is tuned from [5e\u2212 5, 1e\u2212 4, 5e\u2212 4, 1e \u2212 3] for A2C, TRPO, PPO and [0.1, 0.5, 1.0, 5.0] for SAC. Note that for SAC, our entropy regularization is added directly on the optimization objective (equation 12 in Haarnoja et al. (2018)), and is different from the original maximum entropy objective inside the reward term.\nThe optimal policy network regularization strength we selected for each algorithm and environment used in Section 4 can be seen in the legends of Appendix L. In addition to the environment-specificstrength regularization results presented in Section 4, we also present the results when the regularization strength is fixed across all environments for the same algorithm. The results are shown in Appendix H.\nIn Section 6, to investigate the effect of regularizing both policy and value networks, we combine the tuned optimal policy and value network regularization strengths. The detailed training curves are presented in Appendix L.\nAs a side note, when training A2C, TRPO, and PPO on the HalfCheetah environment, the results have very large variance. Thus, for each regularization method, after we obtain the best strength, we rerun it for another five seeds as the final result in Table 1 and 2. C ADDITIONAL TRAINING CURVES\nAs a complement with Figure 1 in Section 4, we plot the training curves of the other five environments in Figure 4. D DEFAULT HYPERPARAMETER SETTINGS FOR BASELINES\nTraining timesteps. For A2C, TRPO, and PPO, we run 5e6 timesteps for Hopper, Walker, and HalfCheetah; 2e7 timesteps for Ant, Humanoid (MuJoCo), and HumanoidStandup; 5e7 timesteps for Humanoid (RoboSchool); and 1e8 timesteps for AtlasForwardWalk and HumanoidFlagrun. For SAC, since its simulation speed is much slower than A2C, TRPO, and PPO (as SAC updates its policy and value networks using a minibatch of replay buffer data at every timestep), and since it takes fewer timesteps to converge, we run 1e6 timesteps for Hopper and Walker; 3e6 timesteps for HalfCheetah and Ant; 5e6 timesteps for Humanoid and HumanoidStandup; and 1e7 timesteps for the RoboSchool environments.\nHyperparameters for RoboSchool. In the original PPO paper (Schulman et al., 2017), hyperparameters for the Roboschool tasks are given, so we apply the same hyperparameters to our training, except that instead of linear annealing the log standard deviation of action distribution from \u22120.7 to \u22121.6, we let it to be learnt by the algorithm, as implemented in OpenAI Baseline (Dhariwal et al., 2017). For TRPO, due to its proximity to PPO, we copy PPO\u2019s hyperparameters if they exist in both algorithms. We then tune the value update step size in [3e\u2212 4, 5e\u2212 4, 1e\u2212 3]. For A2C, we keep the original hyperparameters and tune the number of actors in [32, 128] and the number of timesteps for each actor between consecutive policy updates in [5, 16, 32]. For SAC, we tune the reward scale from [5, 20, 100].\nThe detailed hyperparameters used in our baselines for both MuJoCo and RoboSchool are listed in Tables 7-10. E HYPERPARAMETER SAMPLING DETAILS\nIn Section 5, we present results based on five hyperparameter settings. To obtain such hyperparameter variations, we consider varying the learning rates and the hyperparameters that each algorithm is very sensitive to. For A2C, TRPO, and PPO, we consider a range of rollout timesteps between consecutive policy updates by varying the number of actors or the number of trajectory sampling timesteps for each actor. For SAC, we consider a range of reward scale and a range of target smoothing coefficient.\nMore concretely, for A2C, we sample the learning rate from [2e\u2212 4, 7e\u2212 4, 2e\u2212 3] linear decay, the number of trajectory sampling timesteps (nsteps) for each actor from [3, 5, 16, 32], and the number of actors (nenvs) from [1, 4]. For TRPO, we sample the learning rate of value network (vf_stepsize) from [3e\u2212 4, 5e\u2212 4, 1e\u2212 3] and the number of trajectory sampling timesteps for each actor (nsteps) in [1024, 2048, 4096, 8192]. The policy update uses conjugate gradient descent and is controlled by the max KL divergence. For PPO, we sample the learning rate from [1e\u2212 4 linear, 3e\u2212 4 constant], the number of actors (nenvs) from [1, 2, 4, 8], and the probability ratio clipping range (cliprange) in [0.1, 0, 2]. For SAC, we sample the learning rate from [1e\u2212 4, 3e\u2212 4, 1e\u2212 3] the target smoothing coefficient (\u03c4 ) from [0.001, 0.005, 0.01], and the reward scale from small, default, and large mode.\nThe default reward scale of 5 is changed to (3, 5, 20); 20 to (4, 20, 100); 100 to (20, 100, 400) for each mode, respectively. Sampled hyperparameters 1-5 for each algorithms are listed in Table 11a-11d. F HYPERPARAMETER EXPERIMENT IMPROVEMENT PERCENTAGE\nWe provide the percentage of improvement result in Table 12 as a complement with Table 4. G GENERALIZATION ANALYSIS\nWe provide some experiments to accompany the qualitative analysis on generalization in section 7.\nWe investigate the agent\u2019s obtained rewards on a set of sampled trajectories. We train PPO Humanoid and TRPO Ant models, then evaluate the reward on 100 trajectories and plot the reward distribution in Figure 5 and 6. These trajectories are unseen samples during training, since the state space is continuous. The trajectory reward\u2019s distributions are plotted in the figures. For baseline, some of the trajectories yield relatively high rewards, while others yield low rewards, demonstrating the baseline cannot stably generalize to unseen examples; for regularized models, the rewards are mostly high and have smaller variance, demonstrating they can more stably generalize to unseen samples. This suggests that conventional regularization can improve the model\u2019s generalization ability to larger portion of unseen samples.\nNext, we present the results of varying the number of training samples/timesteps in Figure 7. We find that for regularized models to reach the same level of reward as baseline, they only need much fewer samples in training. Note that the reward is also on unseen samples/trajectories. In addition, regularization\u2019s gain over baseline can be larger when the samples are fewer (SAC Ant, TRPO Ant). Since the ability to learn from fewer samples is closely related to the notion of generalization, we can conclude that regularized models have better generalization ability than baselines. H REGULARIZATION WITH A SINGLE STRENGTH\nIn previous sections, we tune the strength of regularization for each algorithm and environment, as described in Appendix B. Now we restrict the regularization methods to a single strength for each algorithm, across different environments. The results are shown in Table 13, 14, and 15. The selected strength are presented in Table 16. We see that the L2 regularization is still generally the best performing one, but SAC is an exception, where BN is better. This can be explained by the fact that in SAC, the reward scaling coefficient is different for each environment, which potentially causes the optimal L2 and L1 strength to vary a lot across different environments, while BN does not have a strength parameter. L2 0.94 0.50 0.87 0.47 1.07 1.15 1.63 0.69 1.17 0.82 1.70 1.47 1.26 1.21 1.23\n L1 0.94 0.75 0.87 0.47 1.07 0.94 1.70 0.94 1.29 0.94 1.21 1.26 1.26 1.43 1.38\n I REGULARIZING WITH BOTH L2 AND ENTROPY\nWe also investigate the effect of combining L2 regularization with entropy regularization, given that both cases of applying one of them alone yield performance improvement. We take the optimal strength of L2 regularization and entropy regularization together and compare with applying L2 regularization or entropy regularization alone. We find that the performance increases for PPO HumanoidStandup, approximately stays the same for TRPO Ant, and decreases for A2C HumanoidStandup. Thus, the regularization benefits are not always addable. This phenomenon is possibly caused by the fact that the algorithms already achieve good performance using only L2 regularization or entropy regularization, and further performance improvement is restrained by the intrinsic capabilities of algorithms. J COMPARING L2 REGULARIZATION WITH FIXED WEIGHT DECAY (ADAMW)\nFor the Adam optimizer (Kingma & Ba, 2015), \u201cfixed weight decay\u201d (AdamW in Loshchilov & Hutter (2019)) differs from L2 regularization in that the gradient of 12\u03bb||\u03b8||\n2 is not computed with the gradient of the original loss, but the weight is \u201cdecayed\u201d finally with the gradient update. For Adam these two procedures are very different (see Loshchilov & Hutter (2019) for more details). In this section, we compare the effect of adding L2 regularization with that of using AdamW, with PPO on Humanoid and HumanoidStandup. The result is shown in Figure 9. Similar to L2, we briefly tune the strength of weight decay in AdamW and the optimal one is used. We find that while both L2 regularization and AdamW can significantly improve the performance over baseline, the performance of AdamW tends to be slightly lower than the performance of L2 regularization. K TRAINING CURVES FOR HYPERPARAMETER EXPERIMENTS\nIn this section, we plot the full training curves of the experiments in Section 5 with five sampled hyperparameter settings for each algorithm from Figure 10 to Figure 13. The strength of each regularization is tuned according to the range in Appendix B. L TRAINING CURVES FOR POLICY VS. VALUE EXPERIMENTS\nWe plot the training curves with our study in Section 6 on policy and value network regularizations from Figure 14 to Figure 17.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling , Authors: Shaojie Bai J Zico Kolter... , 2018\n2.  Title: Learning phrase representations using rnn encoder-decoder for statistical machine translation , Authors: Kyunghyun Cho Bart Van Merri\u00ebnboer... , 2014\n3.  Title: Language modeling with gated convolutional networks , Authors: Yann N Dauphin Angela Fan... , 2017\n4.  Title: A convolutional encoder model for neural machine translation , Authors: Jonas Gehring Michael Auli... , 2016\n5.  Title: Convolutional sequence to sequence learning , Authors: Jonas Gehring Michael Auli... , 2017\n6.  Title: Wavenet: A generative model for raw audio , Authors: A\u00e4ron Van Den Oord Sander Dieleman... , 2016\n7.  Title: Attention is all you need , Authors: Ashish Vaswani Noam Shazeer... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global longterm dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code and data publicly available 1.", "Recurrent Neural Networks including its variants such LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho et al., 2014) have long been the default choices for generic sequence modeling. A RNN sequentially processes each position in a sequence and maintains an internal hidden state to compresses information of positions that have been seen. While its design is appealing and it has been successfully applied in various tasks, several problems caused by its recursive structures including low computation efficiency and gradient exploding or vanishing make it ineffective when learning long sequences. Therefore, in recent years, a lot of efforts has been made to develop models\nwithout recursive structures and they can be roughly divided into two categories depending whether they rely on convolutions operations or not. The first category includes models that mainly built on convolution operations. For example, van den Oord et al. have designed an autoregressive WaveNet that is based on causal filters and dilated convolution to capture both global and local information in raw audios (Van Den Oord et al., 2016). Ghring et al. has successfully replace traditional RNN based encoder and decoder with convolutional ones and outperforms LSTM setup in neural machine translation tasks (Gehring et al., 2017; 2016). Moreover, researchers introduced gate mechanism into convolutions structures to model sequential dependencies in languages (Dauphin et al., 2017). Most recently, a generic architecture for sequence modeling, termed as Temporal Convolutional Networks (TCN), that combines components from previous works has been proposed in (Bai et al., 2018). Authors in (Bai et al., 2018) have systematically compared TCN with canonical recurrent networks in a wide range of tasks and TCN is able achieve better performance in most cases. Our R-transformer is motivated by works in this group in a sense that we firstly models local information and then focus on global ones. The most popular works in second category are those based on multi-head attention mechanism. The multi-head attention mechanism was firstly proposed in Vaswani et al.", "The authors propose an architecture that combines the practices of recurrent and feed-forward sequence models. However, I have major concerns regarding the novelty this paper, the various claims it makes, as well as its experiment setting.\n\nMajor issues/questions:\n1. The techniques proposed by this paper lack novelty. For instance, the entire section 3.2 is simply the original design of the multi-head self-attention by Vaswani et al. The major difference between R-Transformer and the original Transformer is the replacement of positional embedding with an RNN layer, but (in my opinion) the authors did not demonstrate sufficiently its effectiveness via ablative studies (see below). Moreover, some prior works have already exploited the locality structure in Transformers. For instance, [1] showed that a sparse, local Transformer can work extremely well and be very efficient (they achieved SOTA on large-scale char-level language modeling tasks).\n\n2. The experiments do not entirely convince me. \n    i) The authors use the \"same hidden size for R-Transformer and Transformer.\" But in fact, as the R-Transformer has one extra RNN/LSTM/GRU layer at every level of the network, the tests were carried out (in effect) using a larger model than the baselines. I think the authors should instead control the # of model parameters, especially since you are running only on small tasks with small-sized models.\n    ii) It is nice that the authors tested R-Transformer on a variety of tasks--- this is important. However, in no way do these number achieve the levels of the \"state-of-the-art\", which the authors claim at the end of Section 1 (e.g., [2] has better number on seqMNIST and character-level PTB, and the Transformer-XL actually achieves <55 perplexity on word-level PTB). Therefore, the numbers don't look particularly appealing to me. \n    iii) Lack of more challenging, or large-scale experiments.\n    iv) Lack of ablative study. \n\n3. The authors claim that the finite-window RNN captures local features. But doesn't that claim only applies to the first layer? Once the first layer multi-head attention mixes all input elements across the sequence, the \"local\" features fed into the second layer RNN will be, actually, **global** features? Doesn't that \"defeat\" the purpose of using a local RNN though?\n\n3: Weak Reject"], "input": "1 INTRODUCTION\nRecurrent Neural Networks (RNNs) especially its variants such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have achieved great success in a wide range of sequence learning tasks including language modeling, speech recognition, recommendation, etc (Mikolov et al., 2010; Sundermeyer et al., 2012; Graves & Jaitly, 2014; Hinton et al., 2012; Hidasi et al., 2015). Despite their success, however, the recurrent structure is often troubled by two notorious issues. First, it easily suffers from gradient vanishing and exploding problems, which largely limits their ability to learn very long-term dependencies (Pascanu et al., 2013). Second, the sequential nature of both forward and backward passes makes it extremely difficult, if not impossible, to parallelize the computation, which dramatically increases the time complexity in both training and testing procedure. Therefore, many recently developed sequence learning models have completely jettisoned the recurrent structure and only rely on convolution operation or attention mechanism that are easy to parallelize and allow the information flow at an arbitrary length. Two representative models that have drawn great attention are Temporal Convolution Networks(TCN) (Bai et al., 2018) and Transformer (Vaswani et al., 2017). In a variety of sequence learning tasks, they have demonstrated comparable or even better performance than that of RNNs (Gehring et al., 2017; Bai et al., 2018; Devlin et al., 2018).\nThe remarkable performance achieved by such models largely comes from their ability to capture long-term dependencies in sequences. In particular, the multi-head attention mechanism in Transformer allows every position to be directly connected to any other positions in a sequence. Thus, the information can flow across positions without any intermediate loss. Nevertheless, there are two issues that can harm the effectiveness of multi-head attention mechanism for sequence learning. The first comes from the loss of sequential information of positions as it treats every position identically. To mitigate this problem, Transformer introduces position embeddings, whose effects,\n1https://www.dropbox.com/sh/u35qgqnmjpywcqn/AAAITcId7DRPOD9KRooQW7i2a?dl=0\nhowever, have been shown to be limited (Dehghani et al., 2018; Al-Rfou et al., 2018). In addition, it requires considerable amount of efforts to design more effective position embeddings or different ways to incorporate them in the learning process (Dai et al., 2019). Second, while multi-head attention mechanism is able to learn the global dependencies, we argue that it ignores the local structures that are inherently important in sequences such as natural languages. Even with the help of position embeddings, the signals at local positions can still be very weak as the number of other positions is significantly more.\nTo address the aforementioned limitations of the standard Transformer, in this paper, we propose a novel sequence learning model, termed as R-Transformer. It is a multi-layer architecture built on RNNs and the standard Transformer, and enjoys the advantages of both worlds while naturally avoids their respective drawbacks. More specifically, before computing global dependencies of positions with the multi-head attention mechanism, we firstly refine the representation of each position such that the sequential and local information within its neighborhood can be compressed in the representation. To do this, we introduce a local recurrent neural network, referred to as LocalRNN, to process signals within a local window ending at a given position. In addition, the LocalRNN operates on local windows of all the positions identically and independently and produces a latent representation for each of them. In this way, the locality in the sequence is explicitly captured. In addition, as the local window is sliding along the sequence one position by one position, the global sequential information is also incorporated. More importantly, because the localRNN is only applied to local windows, the aforementioned two drawbacks of RNNs can be naturally mitigated. We evaluate the effectiveness of R-Transformer with a various of sequence learning tasks from different domains and the empirical results demonstrate that R-Transformer achieves much stronger performance than both TCN and standard Transformer as well as other state-of-the-art sequence models.\nThe rest of the paper is organized as follows: Section 2 discusses the sequence modeling problem we aim to solve; The proposed R-Transformer model is presented in Section 3. In Section 4, we describe the experimental details and discuss the results. The related work is briefly reviewed in Section 5. Section 6 concludes this work. 2 SEQUENCE MODELING PROBLEM\nBefore introducing the proposed R-Transformer model, we formally describe the sequence modeling problem. Given a sequence of length N : x1, x2, \u00b7 \u00b7 \u00b7 , xN , we aim to learn a function that maps the\ninput sequence into a label space Y: (f : XN \u2192 Y). Formally,\ny = f(x1, x2, \u00b7 \u00b7 \u00b7 , xN ) (1)\nwhere y \u2208 Y is the label of the input sequence. Depending on the definition of label y, many tasks can be formatted as the sequence modeling problem defined above. For example, in language modeling task, xt is the character/word in a textual sentence and y is the character/word at next position (Mikolov et al., 2010); in session-based recommendation, xt is the user-item interaction in a session and y is the future item that users will interact with (Hidasi et al., 2015); when xt is a nucleotide in a DNA sequence and y is its function, this problem becomes a DNA function prediction task (Quang & Xie, 2016). Note that, in this paper, we do not consider the sequence-to-sequence learning problems. However, the proposed model can be easily extended to solve these problems and we will leave it as one future work. 3 THE R-TRANSFORMER MODEL\nThe proposed R-Transformer consists of a stack of identical layers. Each layer has 3 components that are organized hierarchically and the architecture of the layer structure is shown in Figure 1. As shown in the figure, the lower level is the local recurrent neural networks that are designed to model local structures in a sequence; the middle level is a multi-head attention that is able to capture global long-term dependencies; and the upper level is a position-wise feedforward networks which conducts a non-linear feature transformation. Next, we describe each level in detail. 3.1 LOCALRNN: MODELING LOCAL STRUCTURES\nSequential data such as natural language inherently exhibits strong local structures. Thus, it is desirable and necessary to design components to model such locality. In this subsection, we propose to take the advantage of RNNs to achieve this. Unlike previous works where RNNs are often applied to the whole sequence, we instead reorganize the original long sequence into many short sequences which only contain local information and are processed by a shared RNN independently and identically. In particular, we construct a local window of sizeM for each target position such that the local window includes M consecutive positions and ends at the target position. Thus, positions in each local window form a local short sequence, from which the shared RNN will learn a latent representation. In this way, the local structure information of each local region of the sequence is explicitly incorporated in the learned latent representations. We refer to the shared RNN as LocalRNN. Comparing to original RNN operation, LocalRNN only focuses on local short-term dependencies without considering any long-term dependencies. Figure 2 shows the different between original RNN and LocalRNN operations. Concretely, given the positions xt\u2212M+1, xt\u2212M+2, \u00b7 \u00b7 \u00b7 , xt of a local short sequence of length M , the LocalRNN processes them sequentially and outputs M hidden states, the last of which is used as the representation of the local short sequences:\nht = LocalRNN(xt\u2212M+1, xt\u2212M+2, \u00b7 \u00b7 \u00b7 , xt) (2)\nwhere RNN denotes any RNN cell such as Vanilla RNN cell, LSTM, GRU, etc. To enable the model to process the sequence in an auto-regressive manner and take care that no future information is available when processing one position, we pad the input sequence by (M \u2212 1) positions before the start of a sequence. Thus, from sequence perspective, the LocalRNN takes an input sequence and outputs a sequence of hidden representations that incorporate information of local regions:\nh1, h2, \u00b7 \u00b7 \u00b7 , hN = LocalRNN(x1, x2, \u00b7 \u00b7 \u00b7 , xN ) (3)\nThe localRNN is analogous to 1-D Convolution Neural Networks where each local window is processed by convolution operations. However, the convolution operation completely ignores the sequential information of positions within the local window. Although the position embeddings have been proposed to mitigate this problem, a major deficiency of this approach is that the effectiveness of the position embedding could be limited; thus it requires considerable amount of extra efforts (Gehring et al., 2017). On the other hand, the LocalRNN is able to fully capture the sequential information within each window. In addition, the one-by-one sliding operation also naturally incorporates the global sequential information.\nDiscussion: RNNs have long been a dominating choice for sequence modeling but it severely suffers from two problems \u2013 The first one is its limited ability to capture the long-term dependencies and the second one is the time complexity, which is linear to the sequence length. However, in LocalRNN, these problems are naturally mitigated. Because the LocalRNN is applied to a short sequence within a local window of fixed size, where no long-term dependency is needed to capture. In addition, the computation procedures for processing the short sequences are independent of each other. Therefore, it is very straightforward for the parallel implementation (e.g., using GPUs), which can greatly improve the computation efficiency. 3.2 CAPTURING THE GLOBAL LONG-TERM DEPENDENCIES WITH MULTI-HEAD ATTENTION\nThe RNNs at the lower level introduced in the previous subsection will refine representation of each positions such that it incorporates its local information. In this subsection, we build a sub-layer on top of the LocalRNN to capture the global long-term dependencies. We term it as pooling sublayer because it functions similarly to the pooling operation in CNNs. Recent works have shown that the multi-head attention mechanism is extremely effective to learn the long-term dependencies, as it allows a direct connection between every pair of positions. More specifically, in the multihead attention mechanism, each position will attend to all the positions in the past and obtains a set of attention scores that are used to refine its representation. Mathematically, given current representations h1, h2, \u00b7 \u00b7 \u00b7 , ht, the refined new representations ut are calculated as:\nut =MultiHeadAttention(h1, h2, \u00b7 \u00b7 \u00b7 , ht) (4) = Concatenation(head1(ht), head2(ht), \u00b7 \u00b7 \u00b7 , headk(ht))W o\nwhere headk(ht) is the result of kth attention pooling and W o is a linear projection matrix. Considering both efficiency and effectiveness, the scaled dot product is used as the attention function (Vaswani et al., 2017). Specifically, headi(ht) is the weighted sum of all value vectors and\nthe weights are calculated by applying attention function to all the query, key pairs:\n{\u03b11, \u03b12, \u00b7 \u00b7 \u00b7\u03b1n } = Softmax({ < q, k1 >\u221a (dk) , < q, k2 >\u221a (dk) , \u00b7 \u00b7 \u00b7 , < q, kn >\u221a (dk) }) (5)\nheadi(ht) = n\u2211 j=1 \u03b1jvj\nwhere q, ki, and vi are the query, key, and value vectors and dk is the dimension of ki. Moreover, q, ki, and vi are obtained by projecting the input vectors into query, key and value spaces, respectively (Vaswani et al., 2017). They are formally defined as:\nq, ki, vi =W qht,W khi,W vhi (6)\nwhere W q , W k and W v are the projection matrices and each attention pooling headi has its own projection matrices. As shown in Eq. (5), each headi is obtained by letting ht attending to all the \u201cpast\u201d positions, thus any long-term dependencies between ht and hi can be captured. In addition, different heads will focus on dependencies in different aspects. After obtaining the refined representation of each position by the multi-head attention mechanism, we add a position-wise fully connected feed-forward network sub-layer, which is applied to each position independently and identically. This feedforward network transforms the features non-linearly and is defined as follows:\nFeedForward(mt) = max(0, utW1 + b1)W2 + b2 (7)\nFollowing (Vaswani et al., 2017), We add a residual (He et al., 2016) and layernorm (Ba et al., 2016) connection between all the sub-layers. 3.3 OVERALL ARCHITECTURE OF R-TRANSFORMER\nWith all the aforementioned model components, we can now give a formal description of the overall architecture of an N -layer R-Transformer. For the ith layer (i \u2208 {1, 2, \u00b7 \u00b7 \u00b7N}):\nhi1, h i 2, \u00b7 \u00b7 \u00b7 , hiT = LocalRNN(xi1, xi2, \u00b7 \u00b7 \u00b7 , xiT ) (8)\nh\u0302i1, h\u0302 i 2, \u00b7 \u00b7 \u00b7 , h\u0302iT = LayerNorm(hi1 + xi1, hi2 + xi2, \u00b7 \u00b7 \u00b7 , hiT + xiT )\nui1, u i 2, \u00b7 \u00b7 \u00b7 , uiT =MultiHeadAttention(h\u0302i1, h\u0302i2, \u00b7 \u00b7 \u00b7 , h\u0302iT )\nu\u0302i1, u\u0302 i 2, \u00b7 \u00b7 \u00b7 , u\u0302iT = LayerNorm(ui1 + h\u0302i1, ui2 + h\u0302i2, \u00b7 \u00b7 \u00b7 , uiT + h\u0302iT )\nmi1,m i 2, \u00b7 \u00b7 \u00b7 ,miT = FeedForward(u\u0302i1, u\u0302i2, \u00b7 \u00b7 \u00b7 , u\u0302iT )\nxi+11 , x i+1 2 , \u00b7 \u00b7 \u00b7 , x i+1 T = LayerNorm(m i 1 + u\u0302 i 1,m i 2 + u\u0302 i 2, \u00b7 \u00b7 \u00b7 ,miT + u\u0302iT )\nwhere T is the length of the input sequence and xit is the input position of the layer i at time step t.\nComparing with TCN: R-Transformer is partly motivated by the hierarchical structure in TCN Bai et al. (2018), thus, we make a detailed comparison here. In TCN, the locality in sequences in captured by convolution filters. However, the sequential information within each receptive field is ignored by convolution operations. In contrast, the LocalRNN structure in R-Transformer can fully incorporate it by the sequential nature of RNNs. For modeling global long-term dependencies, TCN achieves it with dilated convolutions that operate on nonconsecutive positions. Although such operation leads to larger receptive fields in lower-level layers, it misses considerable amount of information from a large portion of positions in each layer. On the other hand, the multi-head attention pooling in R-Transformer considers every past positions and takes much more information into consideration than TCN.\nComparing with Transformer: The proposed R-Transformer and standard Transformer enjoys similar long-term memorization capacities thanks to the multi-head attention mechanism (Vaswani et al., 2017). Nevertheless, two important features distinguish R-Transformer from the standard Transformer. First, R-Transformer explicitly and effectively captures the locality in sequences with the novel LocalRNN structure while standard Transformer models it very vaguely with multi-head attention that operates on all of the positions. Second, R-Transformer does not rely on any position embeddings as Transformer does. In fact, the benefits of simple position embeddings are very\nlimited (Al-Rfou et al., 2018) and it requires considerable amount of efforts to design effective position embeddings as well as proper ways to incorporate them (Dai et al., 2019). In the next section, we will empirically demonstrate the advantages of R-Transformer over both TCN and the standard Transformer. 4 EXPERIMENT\nSince the R-Transformer is a general sequential learning framework, we evaluate it with sequential data from various domains including images, audios and natural languages. We mainly compare it with canonical recurrent architectures (Vanilla RNN, GRU, LSTM) and two of the most popular generic sequence models that do not have any recurrent structures, namely, TCN and Transformer. However, since the majority of existing efforts to enhance Transformer are for natural languages, in the natural language evaluation, we also include one recent advanced Transformer, i.e., TransformerXL. For all the tasks, Transformer and R-Transformer were implemented with Pytorch and the results for canonical recurrent architectures and TCN were directly copied from Bai et al. (2018) as we follow the same experimental settings. In addition, to make the comparison fair, we use the same set of hyperparameters (i.e, hidden size, number of layers, number of heads) for R-Transformer and Transformer. Moreover, unless specified otherwise, for training, all models are trained with same optimizer and learning rate is chosen from the same set of values according to validation performance. In addition, the learning rate annealed such that it is reduced when validation performance reaches plateau. 4.1 PIXEL-BY-PIXEL MNIST: SEQUENCE CLASSIFICATION\nThis task is designed to test model ability to memorize long-term dependencies. It was firstly proposed by Le et al. (2015) and has been used by many previous works (Wisdom et al., 2016; Chang et al., 2017; Zhang et al., 2016; Krueger et al., 2016). Following previous settings, we rescale each 28 \u00d7 28 image in MNIST dataset LeCun et al. (1998) into a 784 \u00d7 1 sequence, which will be classified into ten categories (each image corresponds to one of the digits from 0 to 9) by the sequence models. Since the rescaling could make pixels that are connected in the origin images far apart from each other, it requires the sequence models to learn very long-term dependencies to understand the content of each sequence. The dataset is split into training and testing sets as same as the default ones in Pytorch(version 1.0.0) 2. The model hyperparameters and classification accuracy are reported in Table 1. From the table, it can be observed that firstly, RNNs based methods generally perform worse than others. This is because the input sequences exhibit very long-term dependencies and it is extremely difficult for RNNs to memorize them. On the other hand, methods that build direct connections among positions, i.e., Transformer, TCN, achieve much better results. It is also interesting to see that TCN is slightly better than Transformer, we argue that this is because the standard Transformer cannot model the locality very well. However, our proposed R-Transformer that leverages LocalRNN to incorporate local information, has achieved better performance than TCN.\n2https://pytorch.org 4.2 NOTTINGHAM: POLYPHONIC MUSIC MODELING\nNext, we evaluate R-Transformer on the task of polyphonic music modeling with Nottingham dataset (Boulanger-Lewandowski et al., 2012). This dataset collects British and American folk tunes and has been commonly used in previous works to investigate the model\u2019s ability for polyphonic music modeling (Boulanger-Lewandowski et al., 2012; Chung et al., 2014; Bai et al., 2018). Following the same setting in Bai et al. (2018), we split the data into training, validation, and testing sets which contains 694, 173 and 170 tunes, respectively. The learning rate is chosen from {5e\u22124, 5e\u22125, 5e\u22126} and dropout with probability of 0.1 is used to avoid overfitting. Moreover, gradient clipping is used during the training process. We choose negative log-likelihood (NLL) as the evaluation metrics and lower value indicates better performance. The experimental results are shown in Table 2. Both LTSM and TCN outperform Transformer in this task. We suspect this is because these music tunes exhibit strong local structures. While Transformer is equipped with multi-head attention mechanism that is effective to capture long-term dependencies, it fails to capture local structures in sequences that could provide strong signals. On the other hand, R-Transformer enhanced by LocalRNN has achieved much better results than Transformer. In addition, it also outperforms TCN by a large margin. This is expected because TCN tends to ignore the sequential information in the local structure, which can play an important role as suggested by (Gehring et al., 2017). 4.3 PENNTREEBANK: LANGUAGE MODELING\nIn this subsection, we further evaluate R-Transformer\u2019s ability on both character-level and wordlevel language modeling tasks. The dataset we use is PennTreebank(PTB) (Marcus et al., 1993) that contains 1 million words and has been extensively used by previous works to investigate sequence models (Chen & Goodman, 1999; Chelba & Jelinek, 2000; Kim et al., 2016; Tran et al., 2016). For character-level language modeling task, the model is required to predict the next character given a context. Following the experimental settings in Bai et al. (2018), we split the dataset into train-\ning, validation and testing sets that contains 5059K, 396K and 446K characters, respectively. For Transformer and R-Transformer, the learning rate is chosen from {1, 2, 3} and dropout rate is 0.15. Gradient clipping is also used during the training process. The bpc is used to measure the predicting performance.\nFor word-level language modeling, the models are required to predict the next word given the contextual words. Similarly, we follow previous works and split PTB into training, validation, and testing sets with 888K, 70K and 79K words, respectively. The vocabulary size of PTB is 10K. As with character-level language modeling,the learning rate is chosen from {1, 2, 3} for Transformer and R-Transformer and dropout rate is 0.35. In this task, we also add Transformer-XL (Dai et al., 2019) as one baseline, which has been particularly designed for language modeling tasks and has achieved state-of-the-art performance. Note that to make the comparison fair, we apply the same model configuration, i.e., number of layers, to Transformer-XL. All other settings such as optimizer are the same as its original ones. The learning rate is chosen from {0.01, 0.001, 0.0001} and its best validation performance is achieved with 0.001. Note that, except dropout, no other regularization tricks such as variational dropout and weight dropout are applied. The prediction performance is evaluated with perplexity, the lower value of which denotes better performance.\nThe experimental results of character-level and word-level language modeling tasks are shown in Table 3 and Table 4, respectively. Several observations can be made from the Table 3. First, Transformer performs only slightly better than RNNs while much worse than other models. The reason for this observation is similar to the case of polyphonic music modeling task that language exhibits strong local structures and standard Transformer can not fully capture them. Second, TCN achieves better results than all of the RNNs, which is attributed to its ability to capture both local structures and long-term dependencies in languages. Notably, for both local structures and longterm dependencies, R-Transformer has more powerful components than TCN, i.e., LocalRNN and Multi-head attention. Therefore, it is not surprising to see that R-Transformer achieves significantly better results. Table 4 presents the results for word-level language modeling. Similar trends are observed, with the only exception that LSTM achieves the best results among all the methods. In addition, the result of Transformer-XL is only slightly better than R-transformer. Considering the fact that Transformer-XL is specifically designed for language modeling and employs the recurrent connection of segments (Dai et al., 2019), this result suggests the limited contribution of engineered positional embeddings. 4.4 DISCUSSIONS AND EVALUATION LIMITATIONS\nIn summary, experimental results have shown that the standard Transformer can achieve better results than RNNs when sequences exhibit very long-term dependencies, i.e., sequential MNIST while its performance can drop dramatically when strong locality exists in sequences, i.e., polyphonic music and language. Meanwhile, TCN is a very strong sequence model that can effectively learn both local structures and long-term dependencies and has very stable performance in different tasks. More importantly, the proposed R-Transformer that combines a lower level LocalRNN and a higher level multi-head attention, outperforms both TCN and Transformer by a large margin consistently in most of the tasks. The experiments are conducted on various sequential learning tasks with datasets from different domains. Moreover, all experimental settings are fair to all baselines. Thus, the observations from the experiments are reliable with the current experimental settings. However, due to the computational limitations, we are currently restricted our evaluation settings to moderate model and dataset sizes. Thus, more evaluations on big models and large datasets can make the results more convincing. We would like to leave this as one future work. 6 CONCLUSION\nIn this paper, we propose a novel generic sequence model that enjoys the advantages of both RNN and the multi-head attention while mitigating their disadvantages. Specifically, it consists of a LocalRNN that learns the local structures without suffering from any of the weaknesses of RNN and a multi-head attention pooling that effectively captures long-term dependencies without any help of position embeddings. In addition, the model can be easily implemented with full parallelization over the positions in a sequence. The empirical results on sequence modeling tasks from a wide range of domains have demonstrated the remarkable advantages of R-Transformer over state-of-the-art nonrecurrent sequence models such as TCN and standard Transformer as well as canonical recurrent architectures.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Learning to search better than your teacher , Authors: Kai-Wei Chang Akshay Krishnamurthy... , 2015\n2.  Title: Hierarchical imitation and reinforcement learning , Authors: Hoang M Le Nan Jiang... , 2018\n3.  Title: Learning self-correctable policies and value functions from demonstrations with negative sampling , Authors: Yuping Luo Huazhe Xu... , 2019\n4.  Title: Overcoming exploration in reinforcement learning with demonstrations , Authors: Ashvin Nair Bob McGrew... , 2018\n5.  Title: Alvinn: An autonomous land vehicle in a neural network , Authors: Dean A. Pomerleau , 1989\n6.  Title: SQIL: imitation learning via regularized behavioral cloning , Authors: Siddharth Reddy Anca D. Dragan... , 2019\n7.  Title: A reduction of imitation learning and structured prediction to no-regret online learning , Authors: Stephane Ross Geoffrey Gordon... , 2011\n8.  Title: Deeply aggrevated: Differentiable imitation learning for sequential prediction , Authors: Wen Sun Arun Venkatraman... , 2017\n9.  Title: Improving multi-step prediction of learned time series models , Authors: Arun Venkatraman Martial Hebert... , 2015\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["We present a simple and effective algorithm designed to address the covariate shift problem in imitation learning. It operates by training an ensemble of policies on the expert demonstration data, and using the variance of their predictions as a cost which is minimized with RL together with a supervised behavioral cloning cost. Unlike adversarial imitation methods, it uses a fixed reward function which is easy to optimize. We prove a regret bound for the algorithm which is linear in the time horizon multiplied by a coefficient which we show to be low for certain problems on which behavioral cloning fails. We evaluate our algorithm empirically across multiple pixel-based Atari environments and continuous control tasks, and show that it matches or significantly outperforms behavioral cloning and generative adversarial imitation learning.", "The idea of learning through imitation dates back at least to the work of (Pomerleau, 1989), who trained a neural network to imitate the steering actions of a human driver using images as input. The problem of covariate shift was already observed, as the author notes: \u201cthe network must not solely be shown examples of accurate driving, but also how to recover once a mistake has been made\u201d. This issue was formalized in the work of (Ross & Bagnell, 2010), who on one hand proved an O( T 2) regret bound, and on the other hand provided an example showing this bound is tight. The subsequent work (Ross et al., 2011) proposed the DAGGER algorithm which obtains linear regret, provided the agent can both interact with the environment, and query the expert policy. Our approach also requires environment interaction, but importantly does not need to query the expert. note is the work of (Venkatraman et al., 2015), which extended DAGGER to time series prediction problems by using the true targets as expert corrections. Imitation learning has been used within the context of modern RL to help improve sample efficiency (Chang et al., 2015; Ross & Bagnell, 2014; Sun et al., 2017; Hester et al., 2018; Le et al., 2018; Cheng & Boots, 2018) or overcome exploration (Nair et al., 2017). These settings assume the reward is known and that the policies can then be fine-tuned with reinforcement learning. In this case, covariate shift is less of an issue since it can be corrected using the reinforcement signal. The work of (Luo et al., 2019) also proposed a method to address the covariate shift problem when learning from demonstrations when the reward is known, by conservatively extrapolating the value function outside the training distribution using negative sampling. This addresses a different setting from ours, and requires generating plausible states which are off the manifold of training data, which may be challenging when the states are high dimensional such as images. The work of (Reddy et al., 2019) proposed to treat imitation learning within the Q-learning framework, setting a positive reward for all transitions inside the demonstration data and zero reward for all other transitions in the replay buffer. This rewards the agent for repeating (or returning to) the expert\u2019s transitions.", "The paper proposes a simple but effective method to address the important issue of covariate shift. The method performs well empirically and has a theoretical support (although only for a tabular setting). While there are some issues (see below), this is a good paper. I vote for acceptance.  \n\n* Major comments and questions:\n- Accuracy of posterior approximation via ensemble. \nIt is unclear whether the posterior approximated from ensemble is accurate. More specifically, these ensemble policies are trained using BC loss. Under a limited amount of data (where BC fails), these policies would also fail and are inaccurate. Therefore, it should not be expected that a posterior from these inaccurate policies is accurate. Have the authors measure or analyze accuracy of these policies or that of the posterior? This important point is not mentioned or analyzed in the paper.\n\n- Alternative approaches to posterior approximation and uncertainty computation. \nThere are other approaches to obtain a posterior besides the ensemble approach, e.g., Bayesian neural networks. Such alternatives were not mentioned in the paper. Also, there are other quantities for measuring uncertainty besides the variance such as the entropy. These approaches and quantities have different pros and cons and they should be discussed in the paper.\n\n- Sample complexity in terms of environment interactions. \nThe sample complexity in terms of environment interactions is an important criterion for IL. I suggest the authors to include this criterion in the experiments. \n\n* Minor questions:\n- Why does the minibatch size is only 4 in the experiments for all methods. This is clearly too small for a reasonable training of deep networks. Is this a typo?\n\n- It is strange to not evaluate GAIL in the continuous control experiments, since GAIL was originally evaluated in these domains. I strongly suggest the authors to evaluate GAIL (and perhaps stronger methods such as VAIL (Peng et al., 2019)) in the continuous control experiments.\n\n8: Accept"], "input": "1 INTRODUCTION\nTraining artificial agents to perform complex tasks is essential for many applications in robotics, video games and dialogue. If success on the task can be accurately described using a reward or cost function, reinforcement learning (RL) methods offer an approach to learning policies which has proven to be successful in a wide variety of applications (Mnih et al., 2015; 2016; Lillicrap et al., 2016; Hessel et al., 2018). However, in other cases the desired behavior may only be roughly specified and it is unclear how to design a reward function to characterize it. For example, training a video game agent to adopt more human-like behavior using RL would require designing a reward function which characterizes behaviors as more or less human-like, which is difficult.\nImitation learning (IL) offers an elegant approach whereby agents are trained to mimic the demonstrations of an expert rather than optimizing a reward function. Its simplest form consists of training a policy to predict the expert\u2019s actions from states in the demonstration data using supervised learning. While appealingly simple, this approach suffers from the fact that the distribution over states observed at execution time can differ from the distribution observed during training. Minor errors which initially produce small deviations become magnified as the policy encounters states further and further from its training distribution. This phenomenon, initially noted in the early work of (Pomerleau, 1989), was formalized in the work of (Ross & Bagnell, 2010) who proved a quadratic O( T 2) bound on the regret and showed that this bound is tight. The subsequent work of (Ross et al., 2011) showed that if the policy is allowed to further interact with the environment and make queries to the expert policy, it is possible to obtain a linear bound on the regret. However, the ability to query an expert can often be a strong assumption.\nIn this work, we propose a new and simple algorithm called DRIL (Disagreement-Regularized Imitation Learning) to address the covariate shift problem in imitation learning, in the setting where the agent is allowed to interact with its environment. Importantly, the algorithm does not require any additional interaction with the expert. It operates by training an ensemble of policies on the demonstration data, and using the disagreement in their predictions as a cost which is optimized through RL together with a supervised behavioral cloning cost. The motivation is that the policies in the ensemble will tend to agree on the set of states covered by the expert, leading to low cost, but are more likely to disagree on states not covered by the expert, leading to high cost. The RL cost\n\u2217Work done while at Microsoft Research.\nthus guides the agent back towards the distribution of the expert, while the supervised cost ensures that it mimics the expert within the expert\u2019s distribution.\nOur theoretical results show that, subject to realizability and optimization oracle assumptions1, our algorithm obtains aO( \u03baT ) regret bound, where \u03ba is a measure which quantifies a tradeoff between the concentration of the demonstration data and the diversity of the ensemble outside the demonstration data. We evaluate DRIL empirically across multiple pixel-based Atari environments and continuous control tasks, and show that it matches or significantly outperforms behavioral cloning and generative adversarial imitation learning, often recovering expert performance with only a few trajectories. 2 PRELIMINARIES\nWe consider episodic finite horizon MDP in this work. Denote by S the state space, A the action space, and \u03a0 the class of policies the learner is considering. Let T denote the task horizon and \u03c0? the expert policy whose behavior the learner is trying to mimic. For any policy \u03c0, let d\u03c0 denote the distribution over states induced by following \u03c0. Denote C(s, a) the expected immediate cost of performing action a in state s, which we assume is bounded in [0, 1]. In the imitation learning setting, we do not necessarily know the true costs C(s, a), and instead we observe expert demonstrations. Our goal is to find a policy \u03c0 which minimizes an observed surrogate loss ` between its actions and the actions of the expert under its induced distribution of states, i.e.\n\u03c0\u0302 = arg minEs\u223cd\u03c0 [`(\u03c0(s), \u03c0?(s))] (1)\nFor the following, we will assume ` is the total variation distance (denoted by \u2016 \u00b7 \u2016), which is an upper bound on the 0\u22121 loss. Our goal is thus to minimize the following quantity, which represents the distance between the actions taken by our policy \u03c0 and the expert policy \u03c0?:\nJexp(\u03c0) = Es\u223cd\u03c0 [ \u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] (2)\nDenote Q\u03c0t (s, a) as the standard Q-function of the policy \u03c0, which is defined as Q \u03c0 t (s, a) = E [\u2211T \u03c4=t C(s\u03c4 , a\u03c4 )|(st, at) = (s, a), a\u03c4 \u223c \u03c0 ] . The following result shows that if ` is an upper bound on the 0 \u2212 1 loss and C satisfies certain smoothness conditions, then minimizing this loss within translates into an O( T ) regret bound on the true task cost JC(\u03c0) = Es,a\u223cd\u03c0 [C(s, a)]: Theorem 1. (Ross et al., 2011) If \u03c0 satisfies Jexp(\u03c0) = , and Q\u03c0 ? T\u2212t+1(s, a)\u2212Q\u03c0 ? T\u2212t+1(s, \u03c0 ?) \u2264 u for all time steps t, actions a and states s reachable by \u03c0, then JC(\u03c0) \u2264 JC(\u03c0?) + uT .\nUnfortunately, it is often not possible to optimize Jexp directly, since it requires evaluating the expert policy on the states induced by following the current policy. The supervised behavioral cloning cost JBC, which is computed on states induced by the expert, is often used instead:\nJBC(\u03c0) = Es\u223cd\u03c0? [\u2016\u03c0 ?(\u00b7|s)\u2212 \u03c0(\u00b7|s)\u2016] (3)\nMinimizing this loss within yields a quadratic regret bound on regret: Theorem 2. (Ross & Bagnell, 2010) Let JBC(\u03c0) = , then JC(\u03c0) \u2264 JC(\u03c0?) + T 2 .\nFurthermore, this bound is tight: as we will discuss later, there exist simple problems which match the worst-case lower bound. 3 ALGORITHM\nOur algorithm is motivated by two criteria: i) the policy should act similarly to the expert within the expert\u2019s data distribution, and ii) the policy should move towards the expert\u2019s data distribution\n1We assume for the analysis the action space is discrete, but the state space can be large or infinite.\nAlgorithm 1 Disagreement-Regularized Imitation Learning (DRIL) 1: Input: Expert demonstration data D = {(si, ai)}Ni=1 2: Initialize policy \u03c0 and policy ensemble \u03a0E = {\u03c01, ..., \u03c0E} 3: for e = 1, E do 4: Sample De \u223c D with replacement, with |De| = |D|. 5: Train \u03c0e to minimize JBC(\u03c0e) on De to convergence. 6: end for 7: for i = 1, ... do 8: Perform one gradient update to minimize JBC(\u03c0) using a minibatch from D. 9: Perform one step of policy gradient to minimize Es\u223cd\u03c0,a\u223c\u03c0(\u00b7|s)[C clip U (s, a)]. 10: end for\nif it is outside of it. These two criteria are addressed by combining two losses: a standard behavior cloning loss, and an additional loss which represents the variance over the outputs of an ensemble \u03a0E = {\u03c01, ..., \u03c0E} of policies trained on the demonstration data D. We call this the uncertainty cost, which is defined as:\nCU(s, a) = Var\u03c0\u223c\u03a0E(\u03c0(a|s)) = 1\nE E\u2211 i=1 ( \u03c0i(a|s)\u2212 1 E E\u2211 i=1 \u03c0i(a|s) )2\nThe motivation is that the variance over plausible policies is high outside the expert\u2019s distribution, since the data is sparse, but it is low inside the expert\u2019s distribution, since the data there is dense. Minimizing this cost encourages the policy to return to regions of dense coverage by the expert. Intuitively, this is what we would expect the expert policy \u03c0? to do as well. The total cost which the algorithm optimizes is given by:\nJalg(\u03c0) = Es\u223cd\u03c0? [\u2016\u03c0 ?(\u00b7|s)\u2212 \u03c0(\u00b7|s)\u2016]\ufe38 \ufe37\ufe37 \ufe38 JBC(\u03c0) +Es\u223cd\u03c0,a\u223c\u03c0(\u00b7|s)[CU(s, a)]\ufe38 \ufe37\ufe37 \ufe38 JU(\u03c0)\nThe first term is a behavior cloning loss and is computed over states generated by the expert policy, of which the demonstration data D is a representative sample. The second term is computed over the distribution of states generated by the current policy and can be optimized using policy gradient.\nNote that the demonstration data is fixed, and this ensemble can be trained once offline. We then interleave the supervised behavioral cloning updates and the policy gradient updates which minimize the variance of the ensemble. The full algorithm is shown in Algorithm 1. We also found that dropout (Srivastava et al., 2014), which has been proposed as an approximate form of ensembling, worked well (see Appendix D).\nIn practice, for the supervised loss we optimize the KL divergence between the actions predicted by the policy and the expert actions, which is an upper bound on the total variation distance due to Pinsker\u2019s inequality. We also found it helpful to use a clipped uncertainty cost:\nCclipU (s, a) = { \u22121 if CU(s, a) \u2264 q +1 else\nwhere the threshold q is a top quantile of the raw uncertainty costs computed over the demonstration data. The threshold q defines a normal range of uncertainty based on the demonstration data, and values above this range incur a positive cost (or negative reward).\nThe RL cost can be optimized using any policy gradient method. In our experiments we used advantage actor-critic (A2C) (Mnih et al., 2016) or PPO (Schulman et al., 2017), which estimate the expected cost using rollouts from multiple parallel actors all sharing the same policy (see Appendix C for details). We note that model-based RL methods could in principle be used as well if sample efficiency is a constraint. 4 ANALYSIS\n 4.1 COVERAGE COEFFICIENT\nWe now analyze DRIL for MDPs with discrete action spaces and potentially large or infinite state spaces. We will show that, subject to assumptions that the policy class contains an optimal policy and that we are able to optimize costs within of their global minimum, our algorithm obtains a regret bound which is linear in \u03baT , where \u03ba is a quantity which depends on the environment dynamics, the expert distribution d?\u03c0 , and our learned ensemble. Intuitively, \u03ba represents a tradeoff between how concentrated the demonstration data is and how high the variance of the ensemble is outside the expert distribution. Assumption 1. (Realizability) \u03c0? \u2208 \u03a0. Assumption 2. (Optimization Oracle) For any given cost function J , our minimization procedure returns a policy \u03c0\u0302 \u2208 \u03a0 such that J(\u03c0\u0302) \u2264 arg min\u03c0\u2208\u03a0 J(\u03c0) + .\nThe motivation behind our algorithm is that the policies in the ensemble agree inside the expert\u2019s distribution and disagree outside of it. This defines a reward function which pushes the learner back towards the expert\u2019s distribution if it strays away. However, what constitutes inside and outside the distribution, or sufficient agreement or disagreement, is ambiguous. Below we introduce quantities which makes these ideas precise. Definition 1. For any set U \u2286 S, define the concentrability inside of U as \u03b1(U) = max\u03c0\u2208\u03a0 sups\u2208U d\u03c0(s) d\u03c0?(s) .\nThe notion of concentrability has been previously used to give bounds on the performance of value iteration (Munos & Szepesva\u0301ri, 2008). For a set U , \u03b1(U) will be low if the expert distribution has high mass at the states in U that are reachable by policies in the policy class. Definition 2. Define the minimum variance of the ensemble outside of U as \u03b2(U) = mins/\u2208U,a\u2208AVar\u03c0\u223c\u03a0E [\u03c0(a|s)].\nWe now define the \u03ba coefficient as the minimum ratio of these two quantities over all possible subsets of S . Definition 3. We define \u03ba = minU\u2286S \u03b1(U)\u03b2(U) .\nWe can view \u03ba as the quantity which minimizes the tradeoff over different subsets U between coverage by the expert policy inside of U , and variance of the ensemble outside of U . 4.2 REGRET BOUND\nWe now establish a relationship between the \u03ba coefficient just defined, the cost our algorithm optimizes, and Jexp defined in Equation (2) which we would ideally like to minimize and which translates into a regret bound. All proofs can be found in Appendix A. Lemma 1. For any \u03c0 \u2208 \u03a0, we have Jexp(\u03c0) \u2264 \u03baJalg(\u03c0).\nThis result shows that if \u03ba is not too large, and we are able to make our cost function Jalg(\u03c0) small, then we can ensure Jexp(\u03c0) is also small. This result is only useful if our cost function can indeed achieve a small minimum. The next lemma shows that this is the case. Lemma 2. min\u03c0\u2208\u03a0 Jalg(\u03c0) \u2264 2 .\nHere is the threshold specified in Assumption 2. Combining these two lemmas with the previous result of Ross et al. (2011), we get a regret bound which is linear in \u03baT . Theorem 3. Let \u03c0\u0302 be the result of minimizing Jalg using our optimization oracle, and assume that Q\u03c0 ?\nT\u2212t+1(s, a)\u2212Q\u03c0 ? T\u2212t+1(s, \u03c0 ?) \u2264 u for all actions a, time steps t and states s reachable by \u03c0. Then\n\u03c0\u0302 satisfies JC(\u03c0\u0302) \u2264 JC(\u03c0?) + 3u\u03ba T .\nOur bound is an improvement over that of behavior cloning if \u03ba is less than O(T ). Note that DRIL does not require knowledge of \u03ba. The quantity \u03ba is problem-dependent and depends on the\nenvironment dynamics, the expert policy and the policies in the learned ensemble. We next compute \u03ba exactly for a problem for which behavior cloning is known to perform poorly, and show that it is independent of T . Example 1. Consider the tabular MDP given in (Ross & Bagnell, 2010) as an example of a problem where behavioral cloning incurs quadratic regret, shown in Figure 1. There are 3 states S = (s0, s1, s2) and two actions (a1, a2). Each policy \u03c0 can be represented as a set of probabilities \u03c0(a1|s) for each state s \u2208 S 2. Assume the models in our ensemble are drawn from a posterior p(\u03c0(a1|s)|D) given by a Beta distribution with parameters Beta(n1 + 1, n2 + 1) where n1, n2 are the number of times the pairs (s, a1) and (s, a2) occur, respectively, in the demonstration data D. The agent always starts in s0 and the expert\u2019s policy is given by \u03c0?(a1|s0) = 1, \u03c0?(a1|s1) = 0, \u03c0?(a1|s2) = 1. For any (s, a) pair, the task cost is C(s, a) = 0 if a = \u03c0?(s) and 1 otherwise. Here d?\u03c0 = ( 1 T , T\u22121 T , 0). For any \u03c0, d\u03c0(s0) = 1 T and d\u03c0(s1) \u2264 T\u22121 T due to the dynamics of the MDP, so d\u03c0(s)d?\u03c0(s) \u2264 1 for s \u2208 {s0, s1}. Writing out \u03b1({s0, s1}), we get: \u03b1({s0, s1}) = max\u03c0\u2208\u03a0 sups\u2208{s0,s1} d\u03c0(s) d?\u03c0(s) \u2264 1.\nFurthermore, since s2 is never visited in the demonstration data, for each policy \u03c0i in the ensemble we have \u03c0i(a1|s2), \u03c0i(a2|s2) \u223c Beta(1, 1) = Uniform(0, 1). It follows that Var\u03c0\u223c\u03a0E(\u03c0(a|s2)) is approximately equal 3 to the variance of a uniform distribution over [0, 1], i.e. 112 . Therefore:\n\u03ba = min U\u2286S \u03b1(U) \u03b2(U) \u2264 \u03b1({s0, s1}) \u03b2({s0, s1}) . 1 1 12 = 12\nApplying our result from Theorem 3, we see that our algorithm obtains an O( T ) regret bound on this problem, in contrast to the O( T 2) regret of behavioral cloning4. 6 EXPERIMENTS\n 6.1 TABULAR MDPS\nAs a first experiment, we applied DRIL to the tabular MDP of (Ross & Bagnell, 2010) shown in Figure 1. We computed the posterior over the policy parameters given the demonstration data using a separate Beta distribution for each state s with parameters determined by the number of times each action was performed in s. For behavior cloning, we sampled a single policy from this posterior. For DRIL, we sampled an ensemble of 5 policies and used their negative variance to define an additional reward function. We combined this with a reward which was the probability density function of a given state-action pair under the posterior distribution, which corresponds to the supervised learning loss, and used tabular Q-learning to optimize the sum of these two reward functions. This experiment\nwas repeated 500 times for time horizon lengths up to 500 and N = 1, 5, 10 expert demonstration trajectories.\nFigure 2 shows plots of the regret over the 500 different trials across different time horizons. Although BC achieves good average performance, it exhibits poor worst-case performance with some trials incurring very high regret, especially when using fewer demonstrations. Our method has low regret across all trials, which stays close to constant independantly of the time horizon, even with a single demonstration. This performance is better than that suggested by our analysis, which showed a worst-case linear bound with respect to time horizon. 6.2 ATARI ENVIRONMENTS\nWe next evaluated our approach on six different Atari environments. We used pretrained PPO (Schulman et al., 2017) agents from the stable baselines repository (Hill et al., 2018) to generate N = {1, 3, 5, 10, 15, 20} expert trajectories. We compared against two other methods: standard behavioral cloning (BC) and Generative Adversarial Imitation Learning (GAIL). Results are shown in Figure 3a. DRIL outperforms behavioral cloning across most environments and numbers of demonstrations, often by a substantial margin. In many cases, our method is able to match the expert\u2019s performance using a small number of trajectories. Figure 3b shows the evolution of the uncertainty cost and the policy reward throughout training. In all cases, the reward improves while the uncertainty cost decreases.\nWe were not able to obtain meaningful performance for GAIL on these domains, despite performing a hyperparameter search across learning rates for the policy and discriminator, and across different numbers of discriminator updates. We additionally experimented with clipping rewards in an effort to stabilize performance. These results are consistent with those of (Reddy et al., 2019), who also reported negative results when running GAIL on images. While improved performance might be possible with more sophisticated adversarial training techniques, we note that this contrasts with our method which uses a fixed reward function obtained through simple supervised learning.\nIn Appendix D we provide ablation experiments examining the effects of the cost function clipping and the role of the BC loss. We also compare the ensemble approach to a dropout-based approximation and show that DRIL works well in both cases. 6.3 CONTINUOUS CONTROL\nWe next report results of running our method on 6 different continuous control tasks from the PyBullet5 and OpenAI Gym (Brockman et al., 2016) environments. We again used pretrained agents to generate expert demonstrations, and compared to Behavior Cloning and GAIL.\nResults for all methods are shown in Figure 4. In these environments we found Behavior Cloning to be a much stronger baseline than for the Atari environments: in several tasks it was able to match expert performance using as little as 3 trajectories, suggesting that covariate shift may be less of an issue. Our method performs similarly to Behavior Cloning on most tasks, except on Walker2D, where it yields improved performance for N = 1, 3, 5 trajectories. GAIL performs\n5https://github.com/bulletphysics/bullet3/tree/master/examples/ pybullet/gym/pybullet_envs/examples\nsomewhat better than DRIL on HalfCheetah and Walker2D, but performs worse than both DRIL and BC on LunarLander and BipedalWalkerHardcore. The fact that DRIL is competitive across all tasks provides evidence of its robustness. 7 CONCLUSION\nAddressing covariate shift has been a long-standing challenge in imitation learning. In this work, we have proposed a new method to address this problem by penalizing the disagreement between an ensemble of different policies trained on the demonstration data. Importantly, our method requires no additional labeling by an expert. Our experimental results demonstrate that DRIL can often match expert performance while using only a small number of trajectories across a wide array of tasks, ranging from tabular MDPs to pixel-based Atari games and continuous control tasks. On the theoretical side, we have shown that our algorithm can provably obtain a low regret bound for problems in which the \u03ba parameter is low.\nThere are multiple directions for future work. On the theoretical side, characterizing the \u03ba parameter on a larger array of problems would help to better understand the settings where our method can expect to do well. Empirically, there are many other settings in structured prediction (Daume\u0301 et al., 2009) where covariate shift is an issue and where our method could be applied. For example, in dialogue and language modeling it is common for generated text to become progressively less coherent as errors push the model off the manifold it was trained on. Our method could potentially be used to fine-tune language or translation models (Cho et al., 2014; Welleck et al., 2019) after training by applying our uncertainty-based cost function to the generated text. A PROOFS\nLemma 1. For any \u03c0 \u2208 \u03a0 we have Jexp(\u03c0) \u2264 \u03baJalg(\u03c0)\nProof. We will first show that for any \u03c0 \u2208 \u03a0 and U \u2286 S, we have Jexp(\u03c0) \u2264 \u03b1(U)\u03b2(U)Jalg(\u03c0). We can rewrite this as:\nJexp(\u03c0) = Es\u223cd\u03c0 [ \u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] = Es\u223cd\u03c0 [ I(s \u2208 U)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] + Es\u223cd\u03c0 [ I(s /\u2208 U)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ]\nWe begin by bounding the first term:\nEs\u223cd\u03c0 [ I(s \u2208 U)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] = \u2211 s\u2208U d\u03c0(s)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016\n= \u2211 s\u2208U d\u03c0(s) d\u03c0?(s) d\u03c0?(s)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016\n\u2264 \u2211 s\u2208U ( max \u03c0\u2032\u2208\u03a0 sup s\u2208U d\u03c0\u2032(s) d\u03c0?(s) ) \ufe38 \ufe37\ufe37 \ufe38\n\u03b1(U)\nd\u03c0?(s)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016\n= \u03b1(U) \u2211 s\u2208U d\u03c0?(s)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016\n\u2264 \u03b1(U) \u2211 s\u2208S d\u03c0?(s)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016\n= \u03b1(U)Es\u223cd\u03c0? [ \u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] = \u03b1(U)JBC(\u03c0)\nWe next bound the second term:\nEs\u223cd\u03c0 [ I(s /\u2208 U)\u2016\u03c0(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] \u2264 Es\u223cd\u03c0 [ I(s /\u2208 U) ] \u2264 Es\u223cd\u03c0 [ I(s /\u2208 U)mina\u2208AVar\u03c0i\u223c\u03a0E [\u03c0i(a|s)]\n\u03b2(U) ] = 1\n\u03b2(U) Es\u223cd\u03c0\n[ I(s /\u2208 U) \u2211 a\u2208A \u03c0(a|s)Var\u03c0i\u223c\u03a0E [\u03c0i(a|s)] ]\n= 1 \u03b2(U) \u2211 s/\u2208U d\u03c0(s) \u2211 a\u2208A\n\u03c0(a|s)Var\u03c0i\u223c\u03a0E [\u03c0i(a|s)]\ufe38 \ufe37\ufe37 \ufe38 A(\u03c0)\nNow observe we can decompose the RL cost as follows:\nJU(\u03c0) = Es\u223cd\u03c0,a\u223c\u03c0(\u00b7|s) [ Var\u03c0i\u223c\u03a0E\u03c0i(a|s) ] = \u2211 s d\u03c0(s) \u2211 a \u03c0(a|s) [ Var\u03c0i\u223c\u03a0E\u03c0i(a|s)\n] = \u2211 s\u2208U d\u03c0(s) \u2211 a \u03c0(a|s) [ Var\u03c0i\u223c\u03a0E\u03c0i(a|s) ] \ufe38 \ufe37\ufe37 \ufe38\nB(\u03c0)\n+ \u2211 s/\u2208U d\u03c0(s) \u2211 a \u03c0(a|s) [ Var\u03c0i\u223c\u03a0E\u03c0i(a|s) ] \ufe38 \ufe37\ufe37 \ufe38\nA(\u03c0)\nPutting these together, we get the following:\nJexp(\u03c0) \u2264 \u03b1(U)JBC(\u03c0) + 1\n\u03b2(U) A(\u03c0)\n= \u03b1(U)\u03b2(U) \u03b2(U) JBC(\u03c0) + \u03b1(U) \u03b1(U)\u03b2(U) A(\u03c0)\n\u2264 \u03b1(U) \u03b2(U) JBC(\u03c0) + \u03b1(U) \u03b2(U) A(\u03c0)\n\u2264 \u03b1(U) \u03b2(U)\n( JBC(\u03c0) +A(\u03c0) ) \u2264 \u03b1(U) \u03b2(U) ( JBC(\u03c0) + JU(\u03c0)\n) = \u03b1(U) \u03b2(U) Jalg(\u03c0)\nHere we have used the fact that \u03b2(U) \u2264 1 since 0 \u2264 \u03c0(a|s) \u2264 1 and \u03b1(U) \u2265 sups\u2208U d?\u03c0(s) d?\u03c0(s) = 1 hence 1\u03b1(U) \u2264 1. Taking the minimum over subsets U \u2286 S, we get Jexp(\u03c0) \u2264 \u03baJalg(\u03c0).\nLemma 2. min\u03c0\u2208\u03a0 Jalg(\u03c0) \u2264 2\nProof. Plugging the optimal policy into Jalg, we get:\nJalg(\u03c0 ?) = JBC(\u03c0 ?) + JU(\u03c0 ?) = 0 + Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [ Var\u03c0i\u223c\u03a0E [\u03c0i(a|s)] ] = Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [ 1 E E\u2211 i=1 ( \u03c0i(a|s)\u2212 \u03c0\u0304(a|s)\n)2] \u2264 Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [ 1 E E\u2211 i=1 ( \u03c0i(a|s)\u2212 \u03c0?(a|s) )2 + ( \u03c0\u0304(a|s)\u2212 \u03c0?(a|s)\n)2] = Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [ 1 E E\u2211 i=1 ( \u03c0i(a|s)\u2212 \u03c0?(a|s) )2] \ufe38 \ufe37\ufe37 \ufe38\nTerm1\n+Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [( \u03c0\u0304(a|s)\u2212 \u03c0?(a|s) )2] \ufe38 \ufe37\ufe37 \ufe38\nTerm2\nWe will first bound Term 1:\nEs\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [ 1 E E\u2211 i=1 ( \u03c0i(a|s)\u2212 \u03c0?(a|s) )2] = 1 E Es\u223cd\u03c0? [\u2211 a\u2208A \u03c0?(a|s) E\u2211 i=1 ( \u03c0i(a|s)\u2212 \u03c0?(a|s) )2]\n\u2264 1 E Es\u223cd\u03c0? [\u2211 a\u2208A \u03c0?(a|s) E\u2211 i=1 \u2223\u2223\u2223\u03c0i(a|s)\u2212 \u03c0?(a|s)\u2223\u2223\u2223]\n\u2264 1 E Es\u223cd\u03c0? [ E\u2211 i=1 \u2211 a\u2208A \u2223\u2223\u2223\u03c0i(a|s)\u2212 \u03c0?(a|s)\u2223\u2223\u2223]\n\u2264 1 E E\u2211 i=1 Es\u223cd\u03c0? [ \u2016\u03c0i(\u00b7|s)\u2212 \u03c0?(\u00b7|s)\u2016 ] \u2264 1 E E\u2211 i=1\n=\nWe will next bound Term 2:\nEs\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [( \u03c0\u0304(a|s)\u2212 \u03c0?(a|s) )2] = Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [( \u03c0?(a|s)\u2212 1\nE E\u2211 i=1 \u03c0i(a|s) )2]\n= Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [( 1 E E\u2211 i=1 \u03c0?(a|s)\u2212 1 E E\u2211 i=1 \u03c0i(a|s) )2]\n= Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [( 1 E E\u2211 i=1 (\u03c0?(a|s)\u2212 \u03c0i(a|s)) )2]\n\u2264 Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [ 1 E2 E E\u2211 i=1 ( \u03c0?(a|s)\u2212 \u03c0i(a|s) )2] (Cauchy \u2212 Schwarz)\n= 1\nE E\u2211 i=1 Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [( \u03c0?(a|s)\u2212 \u03c0i(a|s) )2] \u2264 1 E E\u2211 i=1 Es\u223cd\u03c0? ,a\u223c\u03c0?(\u00b7|s) [\u2223\u2223\u2223\u03c0?(a|s)\u2212 \u03c0i(a|s)\u2223\u2223\u2223]\n\u2264 1 E E\u2211 i=1 Es\u223cd\u03c0? [ \u2016\u03c0?(\u00b7|s)\u2212 \u03c0i(\u00b7|s)\u2016 ] = 1\nE E\u2211 i=1 JBC(\u03c0i)\n\u2264\nThe last step follows from our optimization oracle assumption: 0 \u2264 min\u03c0\u2208\u03a0 JBC(\u03c0) \u2264 JBC(\u03c0?) = 0, hence JBC(\u03c0i) \u2264 0 + = . Combining the bounds on the two terms, we get Jalg(\u03c0?) \u2264 2 . Since \u03c0? \u2208 \u03a0, the result follows.\nTheorem 1. Let \u03c0\u0302 be the result of minimizing Jalg using our optimization oracle, and assume that Q\u03c0 ?\nT\u2212t+1(s, a) \u2212 Q\u03c0 ? T\u2212t+1(s, \u03c0 ?) \u2264 u for all a \u2208 A, t \u2208 {1, 2, ..., T}, dt\u03c0(s) > 0. Then \u03c0\u0302 satisfies\nJ(\u03c0\u0302) \u2264 J(\u03c0?) + 3u\u03ba T .\nProof. By our optimization oracle and Lemma 2, we have\nJalg(\u03c0\u0302) \u2264 min \u03c0\u2208\u03a0 Jalg(\u03c0) +\n\u2264 2 + = 3\nCombining with Lemma 1, we get:\nJexp(\u03c0\u0302) \u2264 \u03baJalg(\u03c0\u0302) \u2264 3\u03ba\nApplying Theorem 1 from (Ross et al., 2011), we get J(\u03c0\u0302) \u2264 J(\u03c0?) + 3u\u03ba T .\nB IMPORTANCE OF BEHAVIOR CLONING COST\nThe following example shows how minimizing the uncertainty cost alone without the BC cost can lead to highly sub-optimal policies if the demonstration data is generated by a stochastic policy which is only slightly suboptimal. Consider the following deterministic chain MDP:\nThe agent always starts in s1, and gets a reward of 1 in s3 and 0 elsewhere. The optimal policy is given by:\n\u03c0?(\u00b7|s0) = (0, 1) \u03c0?(\u00b7|s1) = (0, 1) \u03c0?(\u00b7|s2) = (0, 1) \u03c0?(\u00b7|s3) = (0, 1)\nAssume the demonstration data is generated by the following policy, which is only slightly suboptimal:\n\u03c0demo(\u00b7|s0) = (0, 1) \u03c0demo(\u00b7|s1) = (0, 1) \u03c0demo(\u00b7|s2) = (0.1, 0.9) \u03c0demo(\u00b7|s3) = (0, 1)\nLet us assume realizability and perfect optimization for simplicity. If both transitions (s2, a0) and (s2, a1) appear in the demonstration data, then Random Expert Distillation (RED) will assign zero\ncost to both transitions. If we do not use bootstrapped samples to train the ensemble, then DRIL without the BC cost (we will call this UO-DRIL for Uncertainty-Only DRIL) will also assign zero cost to both transitions since all models in the ensemble would recover the Bayes optimal solution given the demonstration data. If we are using bootstrapped samples, then the Bayes optimal solution for each bootstrapped sample may differ and thus the different policies in the ensemble might disagree in their predictions, although given enough demonstration data we would expect these differences (and thus the uncertainty cost) to be small.\nNote also that since no samples at the state s0 occur in the demonstration data, both RED and UODRIL will likely assign high uncertainty costs to state-action pairs at (s0, a0), (s0, a1) and thus avoid highly suboptimal policies which get stuck at s0.\nNow consider policies \u03c0\u03021, \u03c0\u03022 given by:\n\u03c0\u03021(\u00b7|s0) = (0, 1) \u03c0\u03021(\u00b7|s1) = (0, 1) \u03c0\u03021(\u00b7|s2) = (1, 0) \u03c0\u03021(\u00b7|s3) = (0, 1)\nand\n\u03c0\u03022(\u00b7|s0) = (0, 1) \u03c0\u03022(\u00b7|s1) = (0, 1) \u03c0\u03022(\u00b7|s2) = (0.2, 0.8) \u03c0\u03022(\u00b7|s3) = (0, 1)\nBoth of these policies only visit state-action pairs which are visited by the demonstration policy. In the case described above, both RED and UO-DRIL will assign \u03c0\u03021 and \u03c0\u03022 similarly low costs. However, \u03c0\u03021 will cycle forever between s1 and s2, never collecting reward, while \u03c0\u03022 will with high probability reach s3 and stay there, thus achieving high reward. This shows that minimizing the uncertainty cost alone does not necessarily distinguish between good and bad policies. However, \u03c0\u03021 will incur a higher BC cost than \u03c0\u03022, since \u03c0\u03022 more closely matches the demonstration data at s2. This shows that including the BC cost can be important for further disambiguating between policies which all stay within the distribution of the demonstration data, but have different behavior within that distribution. C EXPERIMENTAL DETAILS\nC.1 ATARI ENVIRONMENTS\nAll behavior cloning models were trained to minimize the negative log-likelihood classification loss on the demonstration data for 500 epochs using Adam (Kingma & Ba, 2014) and a learning rate of 2.5 \u00b7 10\u22124. We stopped training once the validation error did not improve for 20 epochs. For our method, we initially performed a hyperparameter search on Space Invaders over the values shown in Table 1\nWe then chose the best values and kept those hyperparameters fixed for all other environments. All other A2C hyperparameters follow the default values in the repo (Kostrikov, 2018): policy networks consisted of 3-layer convolutional networks with 8\u221232\u221264 feature maps followed by a single-layer MLP with 512 hidden units.\nFor GAIL, we used the implementation in (Kostrikov, 2018) and replaced the MLP discriminator by a CNN discriminator with the same architecture as the policy network. We initially performed a hyperparameter search on Breakout with 10 demonstrations over the values shown in Table 2. However, we did not find any hyperparameter configuration which performed better than behavioral cloning.\nC.2 CONTINUOUS CONTROL\nAll behavior cloning and ensemble models were trained to minimize the mean-squared error regression loss on the demonstration data for 500 epochs using Adam (Kingma & Ba, 2014) and a learning rate of 2.5 \u00b7 10\u22124. Policy networks were 2-layer fully-connected MLPs with tanh activations and 64 hidden units. D ABLATION EXPERIMENTS\nIn this section we provide ablation experiments examining the effects of the cost function clipping and the role of the BC loss. We also compare the ensemble approach to a dropout-based approximation and show that DRIL works well in both cases.\nResults are shown in Figure 4. First, switching from the clipped cost in {\u22121,+1} to the the raw cost causes a drop in performance. One explanation may be that since the raw costs are always positive (which corresponds to a reward which is always negative), the agent may learn to terminate the episode early in order to minimize the total cost incurred. Using a cost/reward which has both positive and negative values avoids this behavior.\nSecond, optimizing the pure BC cost performs better than the pure uncertainty cost for some environments (SpaceInvaders, BeamRider) while optimizing the pure uncertainty cost performs better than BC in Breakout. DRIL, which optimizes both, has robust performance and performs the best over all environments.\nFor the dropout approximation we trained a single policy network with a dropout rate of 0.1 applied to all layers except the last, and estimated the variance for each state-action pair using 5 different dropout masks. Similarly to the ensemble approach, we computed the 98th quantile of the variance on the demonstration data and used this value in our clipped cost. MC-dropout performs similarly to the ensembling approach, which shows that our method can be paired with different approaches to posterior estimation.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Mixture density networks , Authors: Christopher M Bishop , 1994\n2.  Title: Weight uncertainty in neural networks , Authors: Charles Blundell Julien Cornebise... , 2015\n3.  Title: A weakly informative default prior distribution for logistic and other regression models , Authors: Andrew Gelman Aleks Jakulin... , 2008\n4.  Title: Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper) , Authors: Andrew Gelman , 2006\n5.  Title: Variational dropout and the local reparameterization trick , Authors: Durk P Kingma Tim Salimans... , 2015\n6.  Title: Simple and scalable predictive uncertainty estimation using deep ensembles , Authors: Balaji Lakshminarayanan Alexander Pritzel... , 2017\n7.  Title: Variational dropout sparsifies deep neural networks , Authors: Dmitry Molchanov Arsenii Ashukha... , 2017\n8.  Title: Evidential deep learning to quantify classification uncertainty , Authors: Murat Sensoy Lance Kaplan... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Deterministic neural networks (NNs) are increasingly being deployed in safety critical domains, where calibrated, robust and efficient measures of uncertainty are crucial. While it is possible to train regression networks to output the parameters of a probability distribution by maximizing a Gaussian likelihood function, the resulting model remains oblivious to the underlying confidence of its predictions. In this paper, we propose a novel method for training deterministic NNs to not only estimate the desired target but also the associated evidence in support of that target. We accomplish this by placing evidential priors over our original Gaussian likelihood function and training our NN to infer the hyperparameters of our evidential distribution. We impose priors during training such that the model is penalized when its predicted evidence is not aligned with the correct output. Thus the model estimates not only the probabilistic mean and variance of our target but also the underlying uncertainty associated with each of those parameters. We observe that our evidential regression method learns well-calibrated measures of uncertainty on various benchmarks, scales to complex computer vision tasks, and is robust to adversarial input perturbations.", "Uncertainty estimation has a long history in neural networks, from modeling probability distribution parameters over outputs (Bishop, 1994) to Bayesian deep learning (Kendall & Gal, 2017). Our work builds on this foundation and presents a scalable representation for inferring the parameters of an evidential uncertainty distribution while simultaneously learning regression tasks via MLE. In Bayesian deep learning, priors are placed over network weights and estimated using variational inference (Kingma et al., 2015). Dropout (Gal & Ghahramani, 2016; Molchanov et al., 2017) and BBB (Blundell et al., 2015) rely on multiple samples to estimate predictive variance. Ensembles (Lakshminarayanan et al., 2017) provide a tangential approach where sampling occurs over multiple trained instances. In contrast, we place uncertainty priors over the likelihood function and thus only need a single forward pass to evaluate both prediction and uncertainty. Additionally, our approach of uncertainty estimation proved to be better calibrated and capable of predicting where the model fails. A large topic of research in Bayesian inference focuses on placing prior distributions over hierarchical models to estimate uncertainty (Gelman et al., 2006; 2008). Our methodology falls under the class of evidential deep learning which models higher-order distribution priors over neural network predictions to interpret uncertainty. Prior works in this field (Sensoy et al., 2018; Malinin & Gales, 2018) have focused exclusively on modeling uncertainty in the classification domain with Dirichlet prior distributions. Our work extends this field into the broad range of regression learning tasks (e.g. depth estimation, forecasting, robotic control learning, etc.) and demonstrates generalizability to out-of-distribution test samples and complex learning problems.", "This paper proposes a novel approach to estimate the confidence of predictions in a regression setting. The approach starts from the standard modelling assuming iid samples from a Gaussian distribution with unknown mean and variances and places evidential priors. This opens the door to online applications with fully integrated uncertainty estimates. \n\nPros:\n1.\tNovel approach to regression (a similar work has been published at NeurIPS last year for classification [3]), but the extension of the work to regression is important.\n2.\tThe experimental results show consistent improvement in performance over a wide base of benchmarks, scales to large vision problems and behaves robustly against adversarial examples.\n3.\tThe presentation of the paper is overall nice, and the Figures are very useful to the general comprehension of the article.\nCons:\n1.\tThe theory of evidence, which is not widely known in the ML community, is not clearly introduced. \nI think that the authors should consider adding a section similar to Section 3 of Sensoy et al. [3] should be considered. Currently, the only step explaining the evidential approach that I found was in section 3.1, in a very small paragraph (between \u201cthe mean of [\u2026] to \\lambda + 2\\alpha.\u201d). I believe that the article would greatly benefit from a more thorough introduction of concepts linked to the theory of evidence.\n2.\tThe authors briefly mention that KL is not well defined between some NIG distributions (p.5) and propose a custom evidence regularizer, but there\u2019s very little insight given on how this connects to/departs from the ELBO approach. \n\nOther comments/questions:\n1.\t(p.1)  I\u2019m not sure to fully understand what\u2019s meant by higher-order/lower-order distributions, could you clarify?\n2.\t(p.3) In section 3.1, the term in the total evidence \\phi_j is not defined.\n3.\t(p.3) Could you comment on the implications of assuming that the estimated distribution can be factorized? \n4.\t(p.4) Could you comment on the difference that there is between NLL_ML and NLL_SOS from a modelling perspective?\n5.\t(p.4) The ELBO loss (6) is unclearly defined, and not connected to the direct context. I would suggest moving this to the section 3.3, where the prior p(\\theta) used in eq. (6) is actually defined.\n\nSpelling & typos:\n-\t(p.4) There are several typos in equation (8), where tau should be replaced with 1/\\sigma^2. \n-\t(p.8) In the last sentence, there is \u201cntwork\u201d instead of network.\n-\t(p.9) There is a typo in the name of J\u00f8sang in the references. \n-\t(p.10) In equation (13), due to the change of variable, there should be a \n-(1/\\tau^2) added;  \n-\t(p.10) In equation (14), the \\exp(-\\lambda*\\pi*(\u2026)) should be replaced with \\exp(-\\lambda*\\tau*(\u2026)). \n\n6: Weak Accept"], "input": "2 MODELLING UNCERTAINTIES FROM DATA\n 2.1 PRELIMINARIES\nConsider the following supervised optimization problem: given a dataset, D, of N paired training examples, (x1, y1), . . . , (xN , yN ), we aim to learn a function f , parameterized by a set of weights, w, which approximately solves the following optimization problem:\nmin w\nJ(w); J(w) = 1\nN N\u2211 i=1 Li(w), (1)\nwhere Li(\u00b7) describes a loss function. In this work, we consider deterministic regression problems, which commonly optimize the sum of squared errors, Li(w) = 12 \u2016yi \u2212 f(xi;w)\u2016\n2. In doing so, the model is encouraged to learn the average correct answer for a given input, but does not explicitly model any underlying noise or uncertainty in the data when making its estimation. 2.2 MAXIMUM LIKELIHOOD ESTIMATION\nWe can also approach our optimization problem from a maximum likelihood perspective, where we learn model parameters that maximize the likelihood of observing a particular set of training data. In the context of deterministic regression, we assume our targets, yi, were drawn i.i.d. from a Gaussian distribution with mean and variance parameters \u03b8 = (\u00b5, \u03c32). In maximum likelihood estimation, we aim to learn a model to infer \u03b8 = (\u00b5, \u03c32) that maximize the likelihood of observing our targets, y, given by p(yi|\u03b8). In practice, we minimize the negative log likelihood by setting:\nLi(w) = \u2212 log p(yi|\u00b5, \u03c32\ufe38 \ufe37\ufe37 \ufe38 \u03b8 ) = 1 2 log(2\u03c0\u03c32) + (yi \u2212 \u00b5)2 2\u03c32 . (2)\nIn learning the parameters \u03b8, this likelihood function allows us to successfully model the uncertainty of our data, also known as the aleatoric uncertainty. However, our model remains oblivious to the predictive model or epistemic uncertainty (Kendall & Gal, 2017).\nIn this paper, we present a novel approach for estimating the evidence in support of network predictions by directly learning both the inferred aleatoric uncertainty as well as the underlying epistemic uncertainty over its predictions. We achieve this by placing higher-order prior distributions over the learned parameters governing the distribution from which our observations are drawn. 3 EVIDENTIAL UNCERTAINTY FOR REGRESSION\n 3.1 PROBLEM SETUP\nWe consider the problem where our observed targets, yi, are drawn i.i.d. from a Gaussian distribution now with unknown mean and variance (\u00b5, \u03c32), which we seek to probabilistically estimate. We model this by placing a conjugate prior distribution on (\u00b5, \u03c32). If we assume our observations are drawn from a Gaussian, this leads to placing a Gaussian prior on our unknown mean and an Inverse-Gamma prior on our unknown variance:\n(y1, . . . , yN ) \u223c N (\u00b5, \u03c32) \u00b5 \u223c N (\u03b3, \u03c32\u03bb\u22121) \u03c32 \u223c \u0393\u22121(\u03b1, \u03b2).\nwhere \u0393(\u00b7) is the gamma function, m = (\u03b3, \u03bb, \u03b1, \u03b2), and \u03b3 \u2208 R, \u03bb > 0, \u03b1 > 0, \u03b2 > 0. Our aim is to estimate a posterior distribution q(\u00b5, \u03c32) = p(\u00b5, \u03c32|y1, . . . , yN ). To obtain an approximation for the true posterior, we assume that the estimated distribution can be factorized (Parisi, 1988) such that q(\u00b5, \u03c32) = q(\u00b5) q(\u03c32). Thus, our approximation takes the form of the Gaussian conjugate prior, the Normal Inverse-Gamma (N.I.G.) distribution:\np(\u00b5, \u03c32\ufe38 \ufe37\ufe37 \ufe38 \u03b8 | \u03b3, \u03bb, \u03b1, \u03b2\ufe38 \ufe37\ufe37 \ufe38 m\n) = \u03b2\u03b1 \u221a \u03bb\n\u0393(\u03b1) \u221a 2\u03c0\u03c32\n( 1\n\u03c32\n)\u03b1+1 exp { \u22122\u03b2 + \u03bb(\u03b3 \u2212 \u00b5) 2\n2\u03c32\n} . (3)\nA popular interpretation of the parameters of the conjugate prior distribution is in terms of \u201cvirtualobservations\u201d in support of a given property (Jordan, 2009). For example, the mean of a N.I.G. distribution can be interpreted as being estimated from \u03bb virtual-observations with sample mean \u03b3 while its variance was estimated from 2\u03b1 virtual-observations with sample mean \u03b3 and sum of squared deviations 2\u03b2. Following from this interpretation, we define the total evidence, \u03a6, of our evidential distributions as the sum of all inferred virtual-observations counts: (\u03a6 = \u03bb+ 2\u03b1).\nDrawing a sample \u03b8j from the N.I.G. distribution yields a single instance of our likelihood function, namely N (\u00b5j , \u03c32j ). Thus, the N.I.G. hyperparameters, (\u03b3, \u03bb, \u03b1, \u03b2), determine not only the location but also the dispersion concentrations, or uncertainty, associated with our inferred likelihood function. Therefore, we can interpret the N.I.G. distribution as higher-order, evidential, distribution on top of the unknown lower-order likelihood distribution from which observations are drawn.\nFor example, in Fig. 2A we visualize different evidential N.I.G. distributions with varying model parameters. We illustrate that by increasing the evidential parameters (i.e. \u03bb, \u03b1) of this distribution, the p.d.f. becomes tightly concentrated about its inferred likelihood function. Considering a single parameter realization of this higher-order distribution, cf. Fig. 2B, we can subsequently sample many lower-order realizations of our likelihood function, as shown in Fig. 2C.\nIn this work, we use neural networks to infer the hyperparameters of this higher-order, evidential distribution, given an input. This approach presents several distinct advantages compared to prior work. First, our method enables simultaneous learning of the desired regression task, along with aleatoric and epistemic uncertainty estimation, built in, by enforcing evidential priors. Second, since the evidential prior is a higher-order N.I.G. distribution, the maximum likelihood Gaussian can be computed analytically from the expected values of the (\u00b5, \u03c32) parameters, without the need for sampling. Third, we can effectively estimate the epistemic or model uncertainty associated with the network\u2019s prediction by simply evaluating the variance of our inferred evidential distribution. 3.2 LEARNING THE EVIDENTIAL DISTRIBUTION\nHaving formalized the use of an evidential distribution to capture both aleatoric and epistemic uncertainty, we next describe our approach for learning a model (c.f. Fig. 2D) to output the hyperparameters\nof this distribution. For clarity, we will structure the learning objective into two distinct parts: (1) acquiring or maximizing model evidence in support of our observations and (2) minimizing evidence or inflating uncertainty when the prediction is wrong. At a high level, we can think of (1) as a way of fitting our data to the evidential model while (2) enforces a prior to inflate our uncertainty estimates.\n(1) Maximizing the model fit.\nFrom Bayesian probability theory, the \u201cmodel evidence\u201d, or marginal likelihood, is defined as the likelihood of an observation, yi, given the evidential distribution parameters m and is computed by marginalizing over the likelihood parameters \u03b8:\np(yi|m) = p(yi|\u03b8,m)p(\u03b8|m)\np(\u03b8|yi,m) = \u222b \u03b8 p(yi|\u03b8,m)p(\u03b8|m) d\u03b8. (4)\nThe model evidence is not, in general, straightforward to evaluate since computing it involves integrating out the dependence on latent model parameters:\np(yi|m) = \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e p(yi|\u00b5, \u03c32)p(\u00b5, \u03c32|m) d\u00b5d\u03c32 (5)\nHowever, by placing a N.I.G. evidential prior on our Gaussian likelihood function an analytical solution for the model evidence does exist. For computational reasons, we minimize the negative logarithm of the model evidence (LNLLi (w)). For a complete derivation please refer to Sec. 7.1,\nLNLLi (w) = \u2212 log p(yi|m) = \u2212 log\n( 2 1 2+\u03b1\u03b2\u03b1 \u221a \u03bb\n2\u03c0(1 + \u03bb)\n( 2\u03b2 + \u03bb(\u03b3 \u2212 yi)2\n1 + \u03bb\n)\u2212 12\u2212\u03b1) . (6)\nInstead of modeling this loss using empirical Bayes, where the objective is to maximize model evidence, we alternatively can minimize the sum-of-squared (SOS) errors, between the evidential prior and the data that would be sampled from the associated likelihood. Thus, we define LSOSi (w) as\nLSOSi (w) = E\u03b8\u2032\u223cp(\u03b8|m) [ Ey\u2032\u223cp(y|\u03b8\u2032) [ ||y\u2032 \u2212 yi||22 ]] (7)\n= \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e Ey\u2032\u223cp(y|\u00b5,\u03c32) [ ||y\u2032 \u2212 yi||22 ] p(\u00b5, \u03c32|m) d\u00b5d\u03c32 (8)\n=\n( \u0393(\u03b1\u2212 12 )\n4 \u0393(\u03b1)\u03bb \u221a \u03b2\n)( 2\u03b2(1 + \u03bb) + (2\u03b1\u2212 1)\u03bb(yi \u2212 \u03b3)2 ) . (9)\nA step-by-step derivation is given in Sec. 7.1. In our experiments, using LSOSi (w) resulted in greater training stability and increased performance, compared to the LNLLi (w) loss. Therefore, LSOSi (w) is used in all presented results.\n(2) Minimizing evidence on errors.\nIn the first term of our objective above, we outlined a loss function for training a NN to output parameters of a N.I.G. distribution to fit our observations, either by maximizing the model evidence or minimizing the sum-of-squared errors. Now, we describe how to regularize training by applying a lack of evidence prior (i.e., maximum uncertainty). Therefore, during training we aim to minimize our evidence (or maximize our uncertainty) everywhere except where we have training data.\nThis can be done by minimizing the KL-divergence between the inferred posterior, q(\u03b8), and a prior, p(\u03b8). This has been demonstrated with success in the categorical setting where the uncertainty prior can be set to a uniform Dirichlet (Malinin & Gales, 2018; Sensoy et al., 2018). In the regression setting, the KL-divergence between our posterior and a N.I.G. zero evidence prior (i.e., {\u03b1, \u03bb} = 0) is not well defined (Soch & Allefeld, 2016), please refer to Sec. 7.2 for a derivation. Furthermore, this prior needs to be enforced specifically where there is no support from the data. Past works in classification accomplish this by using the ground truth likelihoood classification (i.e., the one-hot encoded labels) to remove the non-misleading evidence. However, in regression, labels are provided as point targets (not ground truth Gaussian likelihoods). Unlike classification, it is not possible to penalize evidence everywhere except our single point estimate, as this space is infinite and unbounded. Thus, these previously explored approaches for evidential optimization are not directly applicable.\nTo address both of these shortcomings of past works, now in the regression setting, we formulate a novel evidence regularizer, LRi , based on the error of the i-th prediction,\nLRi (w) = \u2016yi \u2212 E[\u00b5i]\u2016p \u00b7 \u03a6 = \u2016yi \u2212 \u03b3\u2016p \u00b7 (2\u03b1+ \u03bb), (10)\nwhere \u2016x\u2016p represents the L-p norm of x. The value of p impacts the penalty imposed on the evidence when a wrong prediction is made. For example, p = 2, heavily over-penalizes the evidence on larger errors, whereas p = 1 and p = 0.5 saturate the evidence penalty for larger errors. We found that p = 1 provided the optimal stability during training and use this value in all presented results.\nThis regularization loss imposes a penalty whenever there is an error in the prediction that scales with the total evidence of our inferred posterior. Conversely, large amounts of predicted evidence will not be penalized as long as the prediction is close to the target observation. We provide an ablation analysis to quantitatively demonstrate the added value of this evidential regularizer in Sec 7.3.2.\nThe combined loss function employed during training consists of the two loss terms for maximizing model evidence and regularizing evidence,\nLi(w) = LSOSi (w) + LRi (w). (11) 3.3 EVALUATING ALEATORIC AND EPISTEMIC UNCERTAINTY\nThe aleatoric uncertainty, also referred to as statistical or data uncertainty, is representative of unknowns that differ each time we run the same experiment. We evaluate the aleatoric uncertainty from E[\u03c32] = \u03b2\u03b1\u22121 . The epistemic, also known as the model uncertainty, describes the estimated uncertainty in the learned model and is defined as Var[\u00b5] = \u03b2(\u03b1\u22121)\u03bb . Note that Var[\u00b5] = E[\u03c3\n2]/\u03bb, which is expected as \u03bb is one of our two evidential virtual-observation counts. 4 EXPERIMENTS\n 4.1 PREDICTIVE ACCURACY AND UNCERTAINTY BENCHMARKING\nWe first qualitatively compare the performance of our approach against a set of benchmarks on a one-dimensional toy regression dataset (Fig. 3). For training and dataset details please refer to Sec. 7.3.1. We compare deterministic regression, as well as techniques using empirical variance of the networks\u2019 predictions such as MC-dropout, model-ensembles, and Bayes-byBackprop which underestimate the uncertainty outside the training distribution. In contrast, evidential regression estimates uncertainty appropriately and grows the uncertainty estimate with increasing distance from the training data.\nAdditionally, we compare our approach to stateof-the-art methods for predictive uncertainty estimation using NNs on common real world datasets used in (Hern\u00e1ndez-Lobato & Adams, 2015; Lakshminarayanan et al., 2017; Gal &\nGhahramani, 2016). We evaluate our proposed evidential regression method against model-ensembles and BBB based on root mean squared error (RMSE), and negative log-likelihood (NLL). We do not provide results for MC-dropout since it consistently performed inferior to the other baselines. The results in Table 1 indicate that although the loss function for evidential regression is more complex than competing approaches, it is the top performer in RMSE and NLL in 8 out of 9 datasets.\nFurthermore, we demonstrate that, on a synthetic dataset with a priori known noise, evidential models can additionally estimate and recover the underlying aleatoric uncertainty. For more information please refer to Sec. 7.3.3 for results and experiment details. 4.2 DEPTH ESTIMATION\nAfter establishing benchmark comparison results, in this subsection we demonstrate the scalability of our evidential learning by extending to the complex, high-dimensional task of depth estimation. Monocular end-to-end depth estimation is a central problem in computer vision which aims to learn a representation of depth directly from an RGB image of the scene. This is a challenging learning task since the output target y is very high-dimensional. For every pixel in the image, we regress over the desired depth and simultaneously estimate the uncertainty associated to that individual pixel.\nOur training data consists of over 27k RGB-to-depth pairs of indoor scenes (e.g. kitchen, bedroom, etc.) from the NYU Depth v2 dataset (Nathan Silberman & Fergus, 2012). We train a U-Net style NN (Ronneberger et al., 2015) for inference. The final layer of our model outputs a single H \u00d7W activation map in the case of deterministic regression, dropout, ensembling and BBB. Evidential models output four final activation maps, corresponding to (\u03b3, \u03bb, \u03b1, \u03b2).\nTable 2 summarizes the size and speed of all models. Evidential models contain significantly fewer trainable parameters than ensembles (where the number of parameters scales linearly with the size of the ensemble). BBB maintains a trainable mean and variance for every weight in the network, so its size is roughly 2\u00d7 larger as well. Since evidential regression models do not require sampling in order to estimate their uncertainty, their forward-pass inference times are also significantly more efficient. Finally, we demonstrate comparable predictive accuracy (through RMSE and NLL) to the other models. For a more detailed breakdown of how the number of samples effects the baselines please refer to Tab. 3. Note that the output size of the depth estimation problem presented significant learning challenges for the BBB baseline, and it was unable to converge during training. As a result, for the remainder of this analysis we compare against only spatial dropout and ensembles.\nWe evaluate these models in terms of their accuracy and their predictive uncertainty on unseen test data. Fig. 4A-C visualizes the predicted depth, absolute error from ground truth, and predictive uncertainty across three randomly picked test images. Ideally, a strong predictive uncertainty would capture any errors in the prediction (i.e., roughly correspond to where the model is making errors). Compared to dropout and ensembling, evidential uncertainty modeling captures the depth errors while providing clear and localized predictions of confidence. In general, dropout drastically underestimates the amount of uncertainty present, while ensembling occasionally overestimates the uncertainty.\nTo evaluate uncertainty calibration to the ground-truth errors, we fit receiver operating characteristic (ROC) curves to normalized estimates of error and uncertainty. Thus, we test the network\u2019s ability to detect how likely it is to make an error at a given pixel using its predictive uncertainty.\nROC curves take into account sensitivity and specificity of the uncertainties towards error predictions and are stronger if they contain greater area under their curve (AUC). Fig. 4D demonstrates that our evidential model provides uncertainty estimates concentrate to where the model is making the errors.\nIn addition to epistemic uncertainty, we also evaluate the aleatoric uncertainty estimates that are learned from our evidential models as well. Fig. 5 compares the evidential aleatoric uncertainty to those obtained by Gaussian likelihood optimization in several domains with high data uncertainty (mirror reflections and poor illumination). The results between both methods are in strong agreement, identifying mirror reflections and dark regions without visible geometry as sources of high uncertainty. 4.3 OUT-OF DISTRIBUTION TESTING\nA key use of uncertainty estimation is to understand when a model is faced with test samples that fall out-of-distribution (OOD) or when the model\u2019s output cannot be trusted. In the previous subsection, we showed that our evidential uncertainties were well calibrated with the model\u2019s errors. In this subsection, we investigate the performance on out-ofdistribution samples. Fig. 6 illustrates predicted depth on various test input images (left) and outside (right) of the original distribution. All images have not been seen by the model during training. We qualitatively and quantitatively demonstrate that the epistemic uncertainty predicted by our evidential model consistently increases on the OOD samples. 4.3.1 ROBUSTNESS TO ADVERSARIAL SAMPLES\nNext, we consider the extreme case of OOD detection where the inputs are adversarially perturbed to inflict maximum error on the model. We compute adversarial perturbations to our test set using the fast gradient sign method (Goodfellow et al., 2014), with increasing scales, , of noise. Fig. 7A\nconfirms that the absolute error of all methods increasing as adversarial noise is added. We also observe a positive effect noise on our predictive uncertainty estimates in Fig. 7B. An additional desirable property of evidential uncertainty modeling is that it presents a higher overall uncertainty when presented with adversarial inputs compared to dropout and ensembling methods. Furthermore, we observe this strong overall uncertainty estimation despite the model losing calibration accuracy from the adversarial examples (Fig. 7C).\nThe robustness of evidential uncertainty against adversarial perturbations is visualized in greater detail in Fig. 7D, which illustrates the predicted depth, error, and estimated pixel-wise uncertainty as we perturb the input image with greater amounts of noise (left-to-right). Note that the predictive uncertainty not only steadily increases as we increase the noise, but the spatial concentrations of uncertainty throughout the image maintain tight correspondence with the error. 6 CONCLUSION\nIn this paper, we develop a novel method for training deterministic NNs that both estimates a desired target and evaluates the evidence in support of the target to generate robust metrics of model uncertainty. We formalize this in terms of learning evidential distributions, and achieve stable training by penalizing our model for prediction errors that scale with the available evidence. Our approach for evidential regression is validated on a benchmark regression task. We further demonstrate that this method robustly scales to a key task in computer vision, depth estimation, and that the predictive uncertainty increases with increasing out-of-distribution adversarial perturbation. This framework for evidential representation learning provides a means to achieve the precise uncertainty metrics required for robust neural network deployment in safety-critical domains. 7 APPENDIX\n 7.1 MODEL EVIDENCE DERIVATIONS\nFor convenience, define \u03c4 = 1/\u03c32 be the precision of a Gaussian distribution. The change of variables transforms the Normal Inverse-Gamma distribution p(\u00b5, \u03c32|\u03b3, \u03bb, \u03b1, \u03b2) to the equivalent Normal Gamma distribution p(\u00b5, \u03c4 |\u03b3, \u03bb, \u03b1, \u03b2), parameterized by precision \u03c4 \u2208 (0,\u221e) instead of variance \u03c32,\np(\u00b5, \u03c4 |\u03b3, \u03bb, \u03b1, \u03b2) = \u03b2 \u03b1 \u221a \u03bb\n\u0393(\u03b1) \u221a 2\u03c0 \u03c4\u03b1\u2212 1 2 e\u2212\u03b2\u03c4e\u2212\n\u03bb\u03c4(\u00b5\u2212\u03b3)2 2 . (12) 7.1.1 TYPE II MAXIMUM LIKELIHOOD LOSS\nMarginalizing out \u00b5 and \u03c4 gives the result of equation 5, p(yi|m) = \u222b \u03c4 \u222b \u00b5 p(yi|\u00b5, \u03c4) p(\u00b5, \u03c4 |\u03b3, \u03bb, \u03b1, \u03b2) d\u00b5d\u03c4 (13)\n= \u222b \u221e \u03c4=0 \u222b \u221e \u00b5=\u2212\u221e [\u221a \u03c4 2\u03c0 e\u2212 \u03c4 2 (yi\u2212\u00b5) 2 ] [ \u03b2\u03b1 \u221a \u03bb \u0393(\u03b1) \u221a 2\u03c0 \u03c4\u03b1\u2212 1 2 e\u2212\u03b2\u03c4e\u2212 \u03bb\u03c4(\u00b5\u2212\u03b3)2 2 ] d\u00b5d\u03c4 (14)\n= \u222b \u221e \u03c4=0 (\u03b2\u03c4)\u03b1 \u0393(\u03b1) \u221a \u03bb 2\u03c0\u03c4(1 + \u03bb) e\u2212\u03b2\u03c4e\u2212 \u03c4\u03bb(\u03b3\u2212yi) 2 2(1\u2212\u03bb) d\u03c4 (15)\n= 2 1 2+\u03b1\u03b2\u03b1\n\u221a \u03bb\n2\u03c0(1 + \u03bb)\n( 2\u03b2 + \u03bb(\u03b3 \u2212 yi)2\n1 + \u03bb\n)\u2212 12\u2212\u03b1 . (16)\nFor computational reasons it is common to instead minimize the negative logarithm of the model evidence.\nLNLLi (w) = \u2212 log p(yi|m) = \u2212 log\n( 2 1 2+\u03b1\u03b2\u03b1 \u221a \u03bb\n2\u03c0(1 + \u03bb)\n( 2\u03b2 + \u03bb(\u03b3 \u2212 yi)2\n1 + \u03bb\n)\u2212 12\u2212\u03b1) (17) 7.1.2 SUM OF SQUARES LOSS\nSimilarly, we can marignalize out \u00b5 and \u03c32 to receive the result of equation 8, LSOSi (w) = \u222b \u03c32 \u222b \u00b5 Ey\u223cp(y|\u00b5,\u03c32) [ ||yi \u2212 y||22 ] p(\u00b5, \u03c32|\u03b3, \u03bb, \u03b1, \u03b2) d\u00b5d\u03c32 (18)\n= \u222b \u03c32 \u222b \u00b5 \u222b y ||yi \u2212 y||22 p(y|\u00b5, \u03c32) p(\u00b5, \u03c32|\u03b3, \u03bb, \u03b1, \u03b2) dy d\u00b5d\u03c32 (19)\n= \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e \u222b \u221e y=\u2212\u221e ||yi \u2212 y||22 [\u221a 1 2\u03c0\u03c32 e\u2212 (y\u2212\u00b5)2 2\u03c32 ] [\n\u03b2\u03b1 \u221a \u03bb\n\u0393(\u03b1) \u221a 2\u03c0\n( 1\n\u03c32\n)\u03b1+ 32 e\u2212 \u03b2 \u03c32 e\u2212 \u03bb(\u00b5\u2212\u03b3)2 2\u03c32 ] dy d\u00b5d\u03c32 (20)\n= \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e [ (yi \u2212 \u00b5)2 + \u03c32 ] [ \u03b2\u03b1\u221a\u03bb \u0393(\u03b1) \u221a 2\u03c0 ( 1 \u03c32 )\u03b1+ 32 e\u2212 \u03b2 \u03c32 e\u2212 \u03bb(\u00b5\u2212\u03b3)2 2\u03c32 ] d\u00b5d\u03c32\n(21)\n= \u222b \u221e \u03c32=0 \u03b2\u03b1 \u03bb\u0393(\u03b1) exp ( \u2212\u03b2/\u03c32 ) \u03c3\u22122(\u03b1+1) ( \u03c32(1 + \u03bb) + \u03bb(yi \u2212 \u03b3)2 ) d\u03c32 (22)\n=\n( \u0393(\u03b1\u2212 12 )\n4 \u0393(\u03b1)\u03bb \u221a \u03b2\n)( 2\u03b2(1 + \u03bb) + (2\u03b1\u2212 1)\u03bb(yi \u2212 \u03b3)2 ) (23) 7.2 KL-DIVERGENCE OF THE NORMAL INVERSE-GAMMA\nThe KL-divergence between two Normal Inverse-Gamma functions is given by (Soch & Allefeld, 2016):\nKL(p(\u00b5, \u03c32|\u03b31, \u03bb1, \u03b11, \u03b21)||p(\u00b5, \u03c32|\u03b32, \u03bb2, \u03b12, \u03b22) (24)\n= 1\n2 \u03b11 \u03b21 (\u00b51 \u2212 \u00b52)2\u03bb2 + 1 2 \u03bb2 \u03bb1 \u2212 1 2 + \u03b12 log ( \u03b21 \u03b22 ) \u2212 log ( \u0393(\u03b11) \u0393(\u03b12) ) (25)\n+ (\u03b11 \u2212 \u03b12)\u03a8(\u03b11)\u2212 (\u03b21 \u2212 \u03b22) \u03b11 \u03b21\n(26)\n\u0393(\u00b7) is the Gamma function and \u03a8(\u00b7) is the Digamma function. The evidence is defined by (2\u03b1+ \u03bb). For zero evidence, both \u03b1 = 0 and \u03bb = 0. To compute the KL divergence between one N.I.G distribution and another with zero evidence we can set either {\u03b12, \u03bb2} = 0 (i.e., forward-KL) in which case, \u0393(0) is not well defined, or {\u03b11, \u03bb1} = 0 (i.e. reverse-KL) which causes a divide-by-zero error of \u03bb1. In either approach, the KL-divergence between an arbitrary N.I.G and one with zero evidence can not be evaluated. 7.3 BENCHMARK REGRESSION TASK EVALUATIONS\n 7.3.1 EPISTEMIC UNCERTAINTY ESTIMATION\nThe training set consists of training examples drawn from y = sin(3x)/(3x) + , where \u223c N (0, 0.02) in the region \u22123 \u2264 x \u2264 3, whereas the test data is unbounded. All models consisted of 100 neurons with 3 hidden layers and were trained to convergence. The data presented in Fig. 3 illustrates the estimated epistemic uncertainty and predicted mean accross the entire test set, \u22123 \u2264 x \u2264 3. 7.3.2 IMPACT OF THE EVIDENTIAL REGULARIZER\nIn the following experiment, we demonstrate the importance of augmenting the training objective with our evidential regularizer LR as introduced in Sec. 3.2. Fig. 8 provides quantitative results on training the same regression problem presented in 7.3.1 with and without this evidential regularization term. This term introduces an \u201cuncertain\u201d prior into our learning process so out-of-distribution (OOD)\nsamples exhibit high epistemic uncertainty. Without the use of this novel loss term, the learned epistemic uncertainty is unreliable on OOD data. 7.3.3 ALEATORIC UNCERTAINTY ESTIMATION\nThe training set consists of training examples drawn from y = sin(3x)/(3x) + (x), where (x) \u223c N (0, s(x)), and s(x) = 120 cos(3.3x) + 0.1. We evaluate against (Kendall & Gal, 2017) which presents an algorithm for heteroscedastic aleatoric uncertainty estimation by inferring the mean and variance of a Gaussian likelihood function. As presented in the paper, training is done by minimizing the negative log-likelihood of the data given the inferred likelihood parameters. Both our network and the baseline Gaussian NLL network consisted of 100 neurons with 3 hidden layers and were trained to convergence. 7.4 ADDITIONAL DEPTH ESTIMATION PERFORMANCE RESULTS\n", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Learning cooperative visual dialog agents with deep reinforcement learning , Authors: Abhishek Das Satwik Kottur... , 2017\n2.  Title: Guesswhat?! visual object discovery through multi-modal dialogue , Authors: Harm de Vries Florian Strub... , 2017\n3.  Title: Criteria for human-compatible ai in two-player vision-language tasks , Authors: Cheolho Han Sang-Woo Lee... , 2017\n4.  Title: Two can play this game: Visual dialog with discriminative question generation and answering , Authors: Unnat Jain Svetlana Lazebnik... , 2018\n5.  Title: Nsml: Meet the mlaas platform with a real-world case study , Authors: Hanjoo Kim Minkyu Kim... , 2018\n6.  Title: Codraw: Visual dialog for collaborative drawing , Authors: Jin-Hwa Kim Devi Parikh... , 2017\n7.  Title: Answerer in questioner\u2019s mind for goal-oriented visual dialogue , Authors: Sang-Woo Lee Yu-Jung Heo... , 2018\n8.  Title: Microsoft coco: Common objects in context , Authors: Tsung-Yi Lin Michael Maire... , 2014\n9.  Title: Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model , Authors: Jiasen Lu Anitha Kannan... , 2017\n10.  Title: Query-reduction networks for question answering , Authors: Minjoon Seo Sewon Min... , 2017\n11.  Title: Neural speed reading via skim-rnn , Authors: Minjoon Seo Sewon Min... , 2018\n12.  Title: Visual reference resolution using attention memory for visual dialog , Authors: Paul Hongsuck Seo Andreas Lehrmann... , 2017\n13.  Title: Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog , Authors: Jiaping Zhang Tiancheng Zhao... , 2018\n14.  Title: Goal-oriented visual question generation via intermediate rewards , Authors: Junjie Zhang Qi Wu... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Answerer in Questioner\u2019s Mind (AQM) is an information-theoretic framework that has been recently proposed for task-oriented dialog systems. AQM benefits from asking a question that would maximize the information gain when it is asked. However, due to its intrinsic nature of explicitly calculating the information gain, AQM has a limitation when the solution space is very large. To address this, we propose AQM+ that can deal with a large-scale problem and ask a question that is more coherent to the current context of the dialog. We evaluate our method on GuessWhich, a challenging task-oriented visual dialog problem, where the number of candidate classes is near 10K. Our experimental results and ablation studies show that AQM+ outperforms the state-of-the-art models by a remarkable margin with a reasonable approximation. In particular, the proposed AQM+ reduces more than 60% of error as the dialog proceeds, while the comparative algorithms diminish the error by less than 6%. Based on our results, we argue that AQM+ is a general task-oriented dialog algorithm that can be applied for non-yes-or-no responses.", "A task-oriented visual dialog problem has recently been paid attention in the field of computer vision and natural language processing (Kim et al., 2017). GuessWhat is one of the famous taskoriented dialog tasks, where the goal is to figure out a target object in the image through a dialog that the answerer has in mind (de Vries et al., 2017). However, GuessWhat is relatively an easy task because it only allows the answer form of yes or no. The baseline visual question answering (VQA) model achieves 78.5%. In the object guessing task (i.e., GuessWhat task itself), the state-of-the-art averaged accuracy of SL, RL (Zhang et al., 2018b), and AQM (Lee et al., 2018) reached 44.6% and 60.8%, and 72.9% at the 5th round, respectively. Random guessing baseline has an accuracy of\n16.0% (Han et al., 2017), thus RL algorithms achieve 53.3% error decrease, whereas AQM achieves 67.7%. GuessWhich is a cooperative two-player game that one player tries to figure out an image out of 9,628 that another has in mind (Das et al., 2017b). GuessWhich uses Visual Dialog dataset (Das et al., 2017a) which includes human dialogs on MSCOCO images (Lin et al., 2014) as well as the captions that are generated. Although GuessWhich is similar to GuessWhat, it is more challenging in every task including asking a question, giving an answer, and guessing the target class. For example, unlike GuessWhat that can be answered in yes or no, the answer can be an arbitrary sentence in GuessWhich. Therefore, the VQA task in the Visual Dialog dataset is much studied than the GuessWhat dataset (Lu et al., 2017; Seo et al., 2017b). Similar to GuessWhat, SL and RL approaches have been applied to solve the GuessWhich task and they showed a moderate increase in performance (Das et al., 2017b; Jain et al., 2018; Zhang et al., 2018a). However, based on the authors\u2019 recent Github implementation1 of the papers in ICCV (Das et al., 2017b), SL and RL methods have shown that only 6% of error is diminished through the dialog compared to the zeroth turn baselines which only use generated caption.", "The paper proposes an improvement over the AQM approach for an information-theoretic framework for task-oriented dialog systems. Specifically, the paper tries to circumvent the problem of explicitly calculating the information gain while asking a question in the AQM setting. While the original AQM approach sweeps over all possible guesses and answers while estimating information gain, this is rendered impractical in scenarios where this space cannot be tractably enumerated. As a solution, AQM+ proposes sweeping over only some top-k relevant instantiations of answers and guesses in this space by normalizing the probabilities of the subset of the space in consideration. \n\nComments:\n- The paper is overloaded with notations and the writing is not very smooth. The terse nature of the content makes it hard to follow in general. If someone apriori was not familiar with task-oriented dialog or the visual dialog setting in Das et al. (2017b), it would be quite hard to follow.\n- While mentioning SL/RL approaches while comparing or introducing the setup, the authors do not make any distinction between discriminative and generative dialog models. Specifically, SL approaches could either be trained discriminatively to rank options among the provided ones given dialog context or in a generative manner via token-level teacher forcing. The authors should clearly make this distinction in the introduction and in other places where it\u2019s needed.\n- The authors should stress more upon the approximations involved while calculating mutual information. As far as I understand, even in the AQM approach the numerator and the denominator within the logarithm are estimated from a different set of parameters and as such they need not be consistent with each other under marginalization. The term resembles MI and ensuring consistency in such a framework would require either of the numerator or the denominator to be close to something like a variational approximation of the true distribution. In addition, AQM+ adopts the same framework as AQM but computes MI over some top-k of the random variables being considered. Could the authors comment more on why restricting the space of r.v.\u2019s to some top-k samples is a good idea? Would that not lead to somewhat of a biased estimator?\n- Unless I am missing something, training aprxAgen from the training data (indA) seems odd. Assuming, this to be Qbot\u2019s mental model of Abot -- there is no prior reason why this should be initialized or trained in such a manner. Similarly, the training paradigm of the depA setting is confusing. If they are trained in a manner similar to a regular Abot -- either SL or RL -- then they\u2019re not approximate mental models but are rather just another Abot agent in play which is being queried by\n\n6: Marginally above acceptance threshold"], "input": "1 INTRODUCTION\nRecent advances in deep learning have led an end-to-end neural approach to task-oriented dialog problems that can reduce a laborious labeling task on states and intents (Bordes & Weston, 2017). Many researchers have applied sequence-to-sequence models (Vinyals & Le, 2015) that are trained in a supervised learning (SL) and a reinforcement learning (RL) fashion to generate an appropriate sentence for the task. In SL approaches, given the dialog histories so far, the model predicts the distribution of the responses from the task-oriented system (Eric & Manning, 2017; de Vries et al., 2017; Zhao et al., 2018). However, the SL approach typically requires a lot of training data to deal with unseen scenarios and cover all trajectories of the vast action space of dialog systems (Wen et al., 2016). Furthermore, because the SL-based model does not consider the sequential characteristic of the dialog, the error may propagate over time that causes an inconsistent dialog (Li et al., 2017; Zhao & Eskenazi, 2016). To address this issue, RL has been applied to the problem (Strub et al., 2017; Das et al., 2017b). By learning the intrinsic planning policy and the reward function, RL approach enables the models to generate a consistent dialog and generalize better on unseen scenarios. However, these methods struggle to find a competent RNN model that uses backpropagation, owing to the complexity of learning a series of sentences (Lee et al., 2018).\nAs an alternative, Lee et al. (2018) have recently proposed Answerer in Questioners Mind (AQM) algorithm that does not depend on a limited capacity of RNN models to cover an entire dialog. AQM treats the problems as twenty question games and selects the question that gives a maximum information gain. Unlike the other approaches, AQM benefits from explicitly calculating the posterior distribution and finding a solution analytically. The authors showed promising results in the task-oriented dialog problem, such as GuessWhat (de Vries et al., 2017), where a questioner tries to find an object that is in answerer\u2019s mind via a series of Yes/No questions. The candidates are confined to the objects that are presented in the given image (less than ten on average). However, this simplified task may not be general enough to practical problems where the number of objects, questions and answers are typically unrestricted. For example, GuessWhich is a generalized version of GuessWhat that has a greater number of class candidates (9,628 images) and a dialog that\nconsists of sentences beyond yes or no (Das et al., 2017b). Because the computational complexity vastly increases to explicitly calculate the information gain over the size of the entire search space, the original AQM algorithm is not scalable to a large scale problem. More specifically, the number of the unit calculation for information gain in GuessWhat is 10 (number of objects) \u00d7 2 (Yes/No), while that of GuessWhich is 10, 000 (number of images) \u00d7 \u221e (answer is a sentence) which makes the computation intractable.\nOne of the interesting ideas Lee et al. (2018) suggested is to retrieve an appropriate question from the training set. Retrieval-based models, which are basically discriminative models that select a response from a predefined candidate set of system responses, are often used in task-oriented dialog tasks (Bordes & Weston, 2017; Seo et al., 2017a; Liu & Perez, 2017). It is critical not to generate sentences that are ill-structured or irrelevant to the task. However, such a discriminative approach does not fit well with complicated task-oriented visual dialog tasks, because asking an appropriate question considering the visual context is crucial to successfully tackle the problem. It is noticeable that AQM achieved high performance even with a retrieval-based approach in GuessWhat by making the candidate set of questions form the training set. However, Han et al. (2017) pointed out that there exist dominant questions in GuessWhat which can be generally applied to all images (contexts), such as \u201cis it left? or \u201cis it human?. Since GuessWhich is a more complicated task where questions dominant for the game are less likely to exist, it is another reason why the original AQM is difficult to be applied.\nTo address this, we propose a more generalized version of AQM, dubbed AQM+. Compared to the original AQM, the proposed AQM+ can easily handle the increased number of questions, answers, and candidate classes by employing an approximation based on subset sampling. Particularly, unlike AQM, AQM+ generates candidate questions and answers at every turn, and then selects one of them to ask a question. Because our algorithm considers the previous history of the dialog, AQM+ can generate a more contextual question. To understand the practicality and demonstrate the superior performance of our method, we conduct extensive experiments and quantitative analysis on GuessWhich. Experimental results show that our model could successfully deal with the answers in sentence and significantly decrease 61.5% of the error while the SL and RL methods decrease less than 6% of the error. The ablation study shows that our information gain approximation is reasonable. Increasing the number of sampling by eight times brought only a marginal improvement of percentile mean rank (PMR) from 94.63% to 94.79%, which indicates that our model can effectively approximate the distribution over the large search space with a small number of sampling. Overall, our experimental results provide meaningful insights on how AQM framework can further provide an additional improvement on top of the SL and RL approaches.\nOur main contributions are summarized as follows:\n\u2022 We propose AQM+ that extends the AQM framework toward the more general and complicated tasks. AQM+ can handle a more complicated problem where the number of candidate classes is extremely large.\n\u2022 At every turn, AQM+ generates a question considering the context of the previous dialog, which is desirable in practice. In particular, AQM+ generates candidate questions and answers at every turn to ask an appropriate question in the context.\n\u2022 AQM+ outperforms comparative deep learning models by a large margin in Guesswhich, a challenging task-oriented visual dialog task. 3 ALGORITHM: AQM+\n 3.1 PROBLEM SETTING\nIn our experiments, a questioner bot (Qbot) and an answerer bot (Abot) cooperatively communicate to achieve the goal via natural language. Under the AQM framework, at each turn t, Qbot generates an appropriate question qt and guesses the target class c given a previous history of the dialog ht\u22121 = (q1:t\u22121, a1:t\u22121, h0). Here, at is the t-th answer and h0 is an initial context that can be obtained before the start of the dialog. We refer to the random variables of target class and the t-th answer as C and At, respectively. Note that the t-th question is not a random variable in our information gain calculation. To distinguish from the random variables, we use a bold face for a set notation of target class, question, and answers; i.e. C,Q, and A.\nFigure 1 explains the AQM+ algorithm applied to GuessWhich game. In Figure 1, c is the image with three elephants, q1 is \u201cAre there many people?\u201d, a1 is \u201cYes it is.\u201d, a2 is \u201cHow many elephants?\u201d, and h0 is \u201cThere are elephants walking in the zoo.\u201d In GuessWhich game, C is the set of test images whose size is 9,628. The size of Q and A is theoretically infinity as questions and answers can be more than one word. 3.2 PRELIMINARY: SL, RL, AND AQM APPROACHES\nIn SL and RL approaches (Das et al., 2017b; Jain et al., 2018; Zhang et al., 2018a), Qbot consists of two RNN modules. One is \u201cQgen\u201d, a question generator finding the solution that maximizes its distribution p\u2020; i.e. q\u2217t = argmax p\n\u2020(qt|ht\u22121). The other is a \u201cQscore\u201d, a class guesser using score function for each class f\u2021(c|ht). Two RNN modules can either be fully separated two RNNs (Strub et al., 2017), or share some recurrent layers but have a different output layer for each (Das et al., 2017b).\nOn the other hand, in the previous AQM approach (Lee et al., 2018), these two RNN-based models are substituted to the calculation that explicitly finds an analytic solution. It finds a question that\n1https://github.com/batra-mlp-lab/visdial-rl\nmaximizes information gain or mutual information I\u0303 , i.e. q\u2217t = argmaxqt\u2208Qfix I\u0303[C,At; qt, ht\u22121], where\nI\u0303[C,At; qt, ht\u22121] = \u2211 c\u2208C \u2211 at\u2208A p\u0302(c|ht\u22121)p\u0303(at|c, qt, ht\u22121) ln p\u0303(at|c, qt, ht\u22121) p\u0303\u2032(at|qt, ht\u22121) . (1)\nHere, a posterior function p\u0302 can be calculated with a following equation in a sequential way, where p\u0302\u2032 is a prior function given h0.\np\u0302(c|ht) \u221d p\u0302\u2032(c|h0) t\u220f\nj=1\np\u0303(aj |c, qj , hj\u22121) = p\u0302(c|ht\u22121)p\u0303(at|c, qt, ht\u22121) (2)\nIn AQM, Equation 1 and Equation 2 can be explicitly calculated from the model. For ease of reference, let us name every component one by one. A module that calculates an information gain I\u0303 is referred to as \u201cQinfo\u201d and a module that finds an approximated answer distribution p\u0303(at|c, qt, ht\u22121) is referred to as \u201caprxAgen\u201d. In AQM, aprxAgen is a model distribution that Qbot has in mind where the target is the true distribution of an answer generator p\u0304(at|c, qt, ht\u22121), which is referred to as \u201cAgen\u201d. Finally, \u201cQpost\u201d denotes a posterior p\u0302 calculation module for guessing a target class.\nAs AQM uses full set of C and A, the complexity depends on the size of C and A. For the question selection, AQM uses a predefined set of candidate questions (Qfix), which is not changed for a different turn. 3.3 AQM+ ALGORITHM\nIn this paper, we propose AQM+ algorithm, which uses sampling-based approximation, for tackling the large-scale task-oriented dialog problem. The core differences of AQM+ from the previous AQM are summarized as follows:\n\u2022 The candidate question set Qt,gen is sampled from p\u2020(qt|ht\u22121) using a beam search at every turn. Previously, Lee et al. (2018) used a predefined set of candidate questions Qfix. For example, one way to obtain Qfix is to select questions from the training dataset randomly, called \u201crandQ\u201d.\n\u2022 The answerer model (aprxAgen, p\u0303) that Qbot has in mind is not a binary classifier (yes/no) but an RNN generator. In addition, aprxAgen does not assume p\u0303(at|c, qt) = p\u0303(at|c, qt, ht\u22121), which is not even an appropriate assumption when the previous and current questions are sequentially related. For example, p(a2 = \u201cyes\u201d | c, q2 = \u201cis left?\u201d) 6= p(a2 = \u201cyes\u201d | c, q2 = \u201cis left?\u201d, a1 = \u201cyes\u201d, q1 = \u201cis right?\u201d). Regardless of the left term, the probability of the right term is almost zero.\n\u2022 To approximate the information gain of each question, the subsets of A and C are also sampled at every turn. The previous algorithm used full set of A and C. We describe an additional explanation on our information gain approximation, infogain topk as below.\nInfogain topk The equation for Infogain topk is as follows:\nI\u0303topk[C,At; qt, ht\u22121] = \u2211\nat\u2208At,topk(qt) \u2211 c\u2208Ct,topk p\u0302reg(c|ht\u22121)p\u0303reg(at|c, qt, ht\u22121) ln p\u0303reg(at|c, qt, ht\u22121) p\u0303\u2032reg(at|qt, ht\u22121) , (3)\nwhere p\u0302reg and p\u0303reg is a normalized version of p\u0302 over Ct,topk and p\u0303 over At,topk(qt), respectively. Here, p\u0303\u2032reg is obtained by using both p\u0302reg and p\u0303reg as follows:\np\u0302reg(c|ht\u22121) = p\u0302(c|ht\u22121)\u2211\nc\u2208Ct,topk p\u0302(c|ht\u22121) (4)\np\u0303reg(at|c, qt, ht\u22121) = p\u0303(at|c, qt, ht\u22121)\u2211\nat\u2208At,topk(qt) p\u0303(at|c, qt, ht\u22121) (5)\np\u0303\u2032reg(at|qt, ht\u22121) = \u2211\nc\u2208Ct,topk\np\u0302reg(c|ht\u22121) \u00b7 p\u0303reg(at|c, qt, ht\u22121) (6)\nEach set is constructed by the following procedures.\n\u2022 Ct,topk \u2190 top-K posterior test images (from Qpost p\u0302(c|ht\u22121)) \u2022 Qt,gen\u2190 top-K likelihood questions using the beam search (from Qgen p\u2020(qt|ht\u22121)) \u2022 At,topk(qt)\u2190 top-1 generated answers from aprxAgen for each question qt and each class\nin Ct,topk (from aprxAgen p\u0303(at|c, qt, ht\u22121))\nTop-K samples may lead our approximation to be biased toward plausible (high-probability) candidate classes and plausible candidate answers. However, we chose to use top-K samples because our main goal is to reduce the entropy over plausible candidate classes and answers, not over the whole candidate classes and answers.\nIn general, the AQM+ algorithm can deal with various problems where |Ct,topk|, |Qt,gen|, and |At,topk(qt)| are all different. Here, | \u00b7 | denotes the cardinality of a set. We can vary the size of each set and control the complexity of the AQM+ algorithm. In our experiments, however, we mainly considered the problem when |Ct,topk| = |Qt,gen| = |At,topk(qt)|. More specifically, |Ct,topk| is equal to |At,topk(qt)| because our model finds a single best answer at given a pair (qt, c) that maximizes p\u0303(at|c, qt, ht\u22121). Therefore, |At,topk| = |Qt,gen| \u00b7 |Ct,topk| per information\ngain calculation where At,topk = {At,topk(qt)|qt \u2208 Qt,gen}. For the detailed explanation, see Algorithm 1 in Appendix A.\nWe also explain the extended sampling method on candidate answers for cases where A 6= C is required. In the extended method, aprxAgen first generates top-m answers for each candidate question and each candidate class, where m is the smallest integer satisfying |A| \u2264 |C| \u00b7m. After that, the candidate answers are randomly removed, leaving only |A| answers. 3.4 LEARNING\nIn all SL, RL, and AQM frameworks, Qbot needs to be trained to approximate the answer-generating probability distribution of Abot. In AQM approach, aprxAgen does not share the parameters with Agen, and therefore also needs to be trained to approximate Agen. AQM can train aprxAgen by the learning strategy of the SL or RL approach. We explain two learning strategies of AQM framework below: indA and depA. In SL approach, Qgen and Qscore are trained from the training data, which have the same or similar distribution to that of the training data used in training Abot. Likewise, in indA setting of AQM approach, aprxAgen is trained from the training data. In RL approach, Qbot uses dialogs made by the conversation of Qbot and Abot and the result of the game as the objective function (i.e. reward). Likewise, in depA setting of AQM approach, aprxAgen is trained from the questions in the training data and following answers obtained in the conversation between Qbot and Abot. We also use the term trueA, referring to the setting where aprxAgen is the same as Agen, i.e. they share the same parameters. Both the previous AQM algorithm and the proposed AQM+ algorithm use these learning strategies. 4 EXPERIMENTS\n 4.1 EXPERIMENTAL SETTING\nGuessWhich Task GuessWhich is a two player game played by Qbot and Abot. The goal of GuessWhich is to figure out a correct answer out of 9,628 test images by asking a sequence of questions. Abot can see the randomly assigned target image, which is unknown to Qbot. Qbot only observes a caption of the image generated from Neuraltalk2 (Vinyals & Le, 2015). To achieve the goal, Qbot asks a series of questions, to which Abot responds with a sentence.\nComparative Models We compare AQM+ with three comparative models, SL-Q, RL-Q, and RLQA (Das et al., 2017b). In SL-Q, Qbot and Abot are trained separately from the training data. In RL-Q, Qbot is initialized by the Qbot trained by SL-Q and then is fine-tuned by RL. Abot is the same as the Abot trained by SL-Q, and is not fine-tuned further. In the original paper (Das et al., 2017b), it was referred to as Frozen-A. By the way, in an RL-QA setting, not only Qbot but also Abot is concurrently trained with Qbot. In the original paper, it was referred to as RL-full-QAf. We also compare our AQM+ with \u201cGuesser\u201d algorithm. Guesser asks a question generated from SL-Q algorithm and calculates posterior by Qpost of AQM+.\nNon-delta vs. Delta Hyperparameter Setting The important issue in our GuessWhich experiment is delta setting. In the paper of Das et al. (2017b), SL-Q, RL-Q, and RL-QA algorithms achieve moderate increases of the performance. In SL-Q, 88.5% of percentile mean rank (PMR) is improved to 90.9%. In RL-QA, 90.6% of PMR is improved to 93.3%. Here, 93.3% of PMR at the zeroth turn means that the model can predict the correct image to be more likely than the other 8,983 images out of 9,628 candidates after exploiting the caption information solely. However, Das et al. (2017b) found that another hyperparameter setting, delta, makes much progress on their algorithm. Delta setting refers to different weights on loss and learning decay rate. Based on the authors\u2019 recent report on Github, SL-Q and RL-QA methods have shown that less than 6% of error is diminished through the dialog compared to the zeroth turn baseline which only uses generated caption. The PMR of the target (class) image which only uses the caption is around 95.5, but the dialog does not improve the PMR to more than 95.8. We use both non-delta setting (the setting in the original paper) and delta setting (the setting in Github) to test the performance of AQM+.\nOther Experimental Setting As shown in Figure 2, our model uses five modules, Qgen, Qscore, aprxAgen, Qinfo, and Qpost. We use the same Qgen and Qscore modules as the comparative SLQ model. In Visual Dialog, Qgen and Qscore share one RNN structure and have different output\nlayers for each. The prior function is obtained from p\u0302\u2032(c|h0) \u221d exp(\u03bb \u00b7 f\u2021(c|h0)) using Qscore, where \u03bb is a balancing hyperparameter between prior and likelihood. We set |Ct,topk| = |Qt,gen| = |At,topk(qt)| = 20. The epoch for SL-Q is 60. The epoch for RL-Q and RL-QA is 20 for non-delta, and 15 for delta, respectively. Our code is modified from the code of Modhe et al. (2018), and we make our code publically available2. All experiments are implemented and fine-tuned with NAVER Smart Machine Learening (NSML) platform (Sung et al., 2017; Kim et al., 2018). 4.2 COMPARATIVE RESULTS\nFigure 3 shows the PMR of the target image for our AQM+ and comparative models across the rounds. Figure 3a corresponds to the non-delta setting in the original paper (Das et al., 2017b) and Figure 3b corresponds to the delta setting proposed in the Github code.\nWe see that SL-Q and RL-QA do not significantly improve the performance after a few rounds, especially for the delta setting. In delta setting, SL-Q increases their performance from 95.45% to 95.72% at 10th round, and RL-QA increases their performance from 95.44% to 95.69%. It means that error drop of SL-Q and RL-QA algorithms is 5.74% and 5.33%, respectively. On the other hand, AQM-indA increases its PMR from 95.45% to 96.53% at the fifth round and reaches 97.17% at the 10th round. Likewise, AQM-depA increases its PMR from 95.45% to 97.48% at the fifth round and reach 98.25% at the 10th round, decreasing 61.5% of error. Note that Guesser w/ indA achieves 96.37% at the 10th round, outperforming SL-Q by a significant margin. It shows that not only the question generation but also the guessing mechanism affects the performance degeneration of SL and RL algorithms. 4.3 ABLATION STUDY\nNo Caption Experiment We test our AQM+ algorithm where no caption information exists. For the zeroth prediction, we simply replace the prior function from Qscore with a uniform function. Since Qgen in either SL-Q or RL-QA is trained also assuming the existence of the caption, we tried two alternative settings to approximate experiments without a caption. The first trial is the zero-caption experiment, where the caption vector is filled with zeros. The second trial is the random-caption experiment, where the caption vector is replaced with a random caption vector, which is not related to the target image. Figure 4a shows that AQM+ performs well for both zero-caption and randomcaption setting. By contrast, SL-Q and RL-QA do not work at all. It seems SL-Q and RL-QA are\n2https://github.com/naver/aqm-plus\nnot trained on the situation where zero-caption vector or even totally wrong caption vector comes. Though training SL-Q and RL-QA for these situations can increase their performance, it is evident that SL and RL algorithms are not robust to unexpected environments. Likewise, we also run no caption experiments for depA setting. For more ablation studies, see Figure 7 in Appendix B.\nRandom Candidate Answers Experiment One of our main arguments is that generating candidate questions from Qscore and candidate answers from aprxAgen at every turn makes AQM+ effectively deal with general and complicated task-oriented dialogs. Supporting the argument, we conducted the experiments under the setting where the answer set is randomly selected from the training data and then fixed. Random selection of candidate answers decreases the performance from 94.64% to 92.78% at indA, non-delta, and the 10th round. Appendix B also includes a discussion on the setting with a predefined candidate question set Qfix.\nNumber of QAC Experiment We also changed the size of subset K = |Qt,gen|=|At,topk(qt)|=|Ct,topk| to check our efficiency of information gain approximation, using non-delta setting. Figure 5a shows the experimental results. Note that AQM+ with the setting of K = 1 corresponds to Guesser. In the setting of non-delta and indA, 94.64% of PMR is achieved when K is 20, whereas 94.79% is achieved when K is 40. Note that 8 times (2 x 2 x 2) complexity increase just improves 0.15% of PMR, showing the efficiency of the setting of K=20 in our experiments. On the other hand, this result also implies that increasing K would make further improvement on the performance. Likewise, in depA setting, changing K from 20 to 40 increases the PMR from 97.44% to 97.77%. For more ablation studies, see Figure 8 in Appendix B. We also changed the size of each subset, |Qt,gen|, |At,topk(qt)|, and |Ct,topk|. Figure 5b-d shows the results. |Qt,gen| has the most effect, whereas |At,topk(qt)| has the least effect. Generated Questions and Selected Images Figure 6 shows the top-k images selected by AQM+\u2019s posterior. Non-delta and indA setting is used. The figure shows that relevant images to the caption remained after few dialog turns. The bottom number in the image denotes posterior of the image AQM+ thinks of. We also compare selected examples of generated dialog of SL-Q, RL-QA, and AQM+ w/ indA for delta setting. See Figure 10 in Appendix C for the results. 5 DISCUSSION\n 5.1 DIFFICULTY OF GUESSWHICH\nAccording to our results, we infer that PMR degradation of comparative SL and RL models during the dialog is not caused by forgetting dialog context to ask an appropriate question. Comparative results between AQM+ and Guesser show that the improvement from AQM+\u2019s Qpost is significant, which implies that the major constraint of SL and RL is the limited capacity of RNN and its softmax score function.\nAnother reason for the poor performance lies in the current status of VQA models. According to Das et al. (2017a), they discovered a variety of models, one of which is used in both the study of Das et al. (2017b) and our experiments, and can already reach 41.2% for answer retrieval accuracy from 100 candidate answers, solely using the question without exploiting image and history information. Fully exploiting these factors, however, increases the performance only slightly to 45.5%. As discrimination on different images relies on image and history information, Qbot suffers to gain meaningful information through the dialog. Therefore, applying AQM+ to the GuessWhich problem means that we not only solve a very complicated problem, but also find that the AQM framework is applicable to the situation where the answer has high uncertainty. 5.2 NOTES ON COMPARATIVE ANALYSIS\nFine-tuning both Qbot and Abot through RL Though RL-QA is the main setting in the work of Das et al. (2017b), there are some reports indicating that fine-tuning both Qbot and Abot is unfair\n(de Vries et al., 2017; Han et al., 2017), as one of the ultimate goals in this field is to make a questioner be able to talk with human. If the distribution of Abot is not fixed during RL, Qbot and Abot can make their own language which is not compatible to natural language (Kottur et al., 2017). To prevent this problem, many studies added the objective function of language model during RL (Zhu et al., 2017; Das et al., 2017b). However, even though the generated dialog is tuned to be like human dialog, the performance of RL-QA on the conversation with human would decrease compared to SL-Q, because the distribution of Abot become far from human\u2019s (Chattopadhyay et al., 2017; Lee et al., 2018). Moreover, achieving a good performance by fine-tuning both Qbot and Abot is much easier than fine-tuning only Qbot (Zhu et al., 2017; Han et al., 2017). Thus, it is reasonable to compare AQM+ w/ indA and AQM+ w/ depA with SL-Q and RL-Q, respectively.\nCompuational Cost AQM+ at K=20 uses 20\u00d720\u00d720 calculations for information gain. On the other hand, the previous AQM requires 20\u00d7\u221e\u00d79628 calculations for information gain, which makes the computation intractable. Even if we use only 100 candidate answers, which is in the Visual Dialog dataset (Das et al., 2017a), the previous AQM requires 2500 times as many calculations (20M) as AQM+. On the other hand, AQM+ requires more calculations and thus requires more inference time than SL or RL. AQM+ generates one question within around 3s when K=20, whereas SL generates one question within 0.1s. We used Tesla P40 for our experiments. Though the complexity of our information gain is O(K3), K does not increase the time required for the whole inference in proportion to the cube of K, when K=20. It is because calculating the information gain is not the sole resource-intensive part in the whole inference process. 5.3 TOWARD PRACTICAL APPLICATIONS\nThere are plenty of potential future works to improve the performance of AQM+ in real task-oriented dialog applications. For example, robust task-oriented dialog systems are required for appropriately replying to user\u2019s questions (Li et al., 2017) and responding for chit-chat style conversation (Zhao et al., 2017). The question quality can also be improved by diverse beam search approaches (Vijayakumar et al., 2016; Li et al., 2016), which prevent sampling similar questions for the candidate set. We highlight two issues described below; online learning and fast inference.\nOnline Learning For a novel answerer, fine-tuning on the dialog model is required (Krause et al., 2018). If the experiences of many users are available, model-agnostic meta learning (MAML) (Finn et al., 2017) can be applied for few-shot learning. Updating the hyperparameter \u03bb in an online manner, which balances the effect of the prior and the likelihood, can also be effective in practice. If the answer distribution of user is different from our aprxAgen, we can increase \u03bb to decrease the effect of the likelihood.\nFast Inference AQM+\u2019s time complexity can be decreased further by changing the structure of aprxAgen. In specific, we can apply diverse methods such as skipping the update of hidden states in some steps (Seo et al., 2018), using convolution networks or self-attention networks (Yu et al., 2018; Vaswani et al., 2017), substituting matrix multiplication operation for hidden state update to weighted addition (Yu & Liu, 2018), and direct information gain inference from the neural networks (Belghazi et al., 2018). 6 CONCLUSION\nAsking appropriate questions in practical applications has recently been paid attention (Rao & Daume\u0301 III, 2018; Buck et al., 2018). We proposed AQM+ algorithm that is a large-scale extension of AQM framework. AQM+ can ask an appropriate question considering the context of the dialog, handle the responses in a sentence form, and efficiently estimate information gain of the target class with a given question. This improvement makes our AQM framework to step forward toward practical task-oriented applications. AQM+ not only outperforms the comparative SL and RL algorithms, but also enlarges the gap between AQM+ and the comparative algorithms comparing to the performance gaps reported in GuessWhat. AQM+ acheives more than 60% error decreases through the dialog, whereas the comparative algorithms only achieve 6% error decreases. Moreover, the performance of AQM+ can be boosted further by employing the models recently proposed in the visual dialog field such as other question generator models (Jain et al., 2018) and question answering models (Kottur et al., 2018). ACKNOWLEDGMENTS\nThe authors would like to thank Yu-Jung Heo, Hwiyeol Jo, and Kyunghyun Cho for helpful comments. This work was supported by the Creative Industrial Technology Development Program (10053249) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea). APPENDIX A. AQM+ ALGORITHM\nThe question generating process of AQM+ used in our GuessWhich experiments are as follows.\nAlgorithm 1 Question Generating Process of AQM+ in Our GuessWhich Experiments\np\u0302\u2032(c|h0) \u221d exp(\u03bb \u00b7 f\u2021(c|h0)) for t = 1:T do Ct,topk \u2190 top-K posterior test image (from Qpost p\u0302(c|ht\u22121)) Qt,gen\u2190 top-K likelihood questions using beam search (from Qgen p\u2020(qt|ht\u22121)) At,topk(qt) \u2190 generated answers from aprxAgen for question qt and each class in Ct,topk (from aprxAgen p\u0303(at|c, qt, ht\u22121)) qt \u2190 argmaxq\u2032t\u2208Qt,gen I\u0303[C,At; q \u2032 t, a1:t\u22121, q1:t\u22121] with At,topk(qt) and Ct,topk in Eq. 1\nGet at from Agen p\u0304(at|c, qt, ht\u22121) Update Qpost p\u0302(c|ht) \u221d p\u0303(at|c, qt, ht\u22121) \u00b7 p\u0302(c|ht\u22121) in Eq. 2\nend for APPENDIX B. ABLATION STUDY\nFigure 7 shows the results of the number of QAC ablation experiment on depA and trueA, in the non-delta setting. The effect of K decreases in trueA compared to indA, which indicates that the similarity between the distribution of aprxAgen and Agen is related to the effectiveness of large K. Figure 8 shows the results of the no caption experiment on depA and trueA, in the non-delta setting.\nFigure 9 shows the experimental results on the model where AQM+\u2019s Qinfo is used as the questiongenerator and SL\u2019s Qscore is used as the guesser. AQM+s Qinfo does not improve the performance of SLs guesser (Qscore). Our analysis of the results is as follows. For delta setting, the SL guesser is not able to obtain the information from the answers. For the non-delta case, not dialog history but caption information gives dominant information to SLs guesser. The questions which often appear with caption thus gave a more clear signal for the target class for SLs guesser. Figure 9a shows that SL-Q performs better than RL-Q in the early phase, but SL-Qs performance decreases faster than that of RL-Q in the later phase. It is because SL-Q generates the question to be more likely to have co-appeared with the caption than RL-Q. Likewise, AQM+s question does not help SLs guesser because AQM+ generates questions that are more independent of the caption.\nWe conducted the experiments under the setting where a predefined candidate question set Qfix is used. The discussion section in the work of Lee et al. (2018) includes an experimental setting in which the candidate questions are generated from an end-to-end SL model only at the first turn. We refer to this setting as gen1Q, as in the previous AQM paper. Figure 10 shows the results of gen1Q ablation study. Note that this setting of |Q|=100 requires five times as many computations to calculate the information gain as the original AQM+, despite gen1Q performs even worse than Guesser baseline. Another noticeable phenomenon is that there is no significant performance loss in trueA setting. Since aprxAgen in trueA knows the exact probability of Abot\u2019s answer, by exploiting such an aprxAgen, Qbot in trueA can clearly distinguish between different classes by capturing even the subtle differences in answer distributions given similar questions. We also performed the experiments under the setting where Qfix comes from training data. Figure 11 shows the results of randQ ablation study. The baseline method with this Qfix showed accuracy degradation. Regardless of the PMR, we point out that randQ retrieves questions relevant to neither the caption nor the target image. It is why we generate candidate questions from a seq-to-seq model.\nFigure 12 shows the results of the no history experiment. Dialog history helps to guess the target image but is not critical. Ablating history makes the performance decrease by 0.22% and 0.56% for indA and depA in non-delta, respectively, and 0.46% and 0.21% for indA and depA in delta, respectively.\n.\n. APPENDIX C. GENERATING SENTENCES\nFigure 13 shows selected examples of generated questions in delta setting. Though delta setting boosts to increase PMR of the zeroth turn much, it degenerates the question quality, especially for RL-QA. Moreover, RL-QA tends to concentrate on the first turn, leaving questions and answers of the remaining turns meaningless.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Learning robust rewards with adversarial inverse reinforcement learning , Authors: Justin Fu Katie Luo... , 2017\n2.  Title: Continuous deep q-learning with model-based acceleration , Authors: Shixiang Gu Timothy P. Lillicrap... , 2016\n3.  Title: Adversary a3c for robust reinforcement learning, 2018 , Authors: Zhaoyuan Gu Zhenzhong Jia... , 2018\n4.  Title: Adversarial attacks on neural network policies , Authors: Sandy Huang Nicolas Papernot... , 2017\n5.  Title: Deep reinforcement learning doesn\u2019t work yet , Authors: Alex Irpan , 2018\n6.  Title: Reinforcement learning in robust markov decision processes , Authors: Shiau Hong Lim Huan Xu... , 2016\n7.  Title: Tactics of adversarial attack on deep reinforcement learning agents , Authors: Yen-Chen Lin Zhang-Wei Hong... , 2017\n8.  Title: Learning from corrupted binary labels via class-probability estimation , Authors: Aditya Menon Brendan Van Rooyen... , 2015\n9.  Title: Learning with noisy labels , Authors: Nagarajan Natarajan Inderjit S Dhillon... , 2013\n10.  Title: Robust adversarial reinforcement learning , Authors: Lerrel Pinto James Davidson... , 2017\n11.  Title: Epopt: Learning robust neural network policies using model ensembles , Authors: Aravind Rajeswaran Sarvjeet Ghotra... , 2016\n12.  Title: Reinforcement learning under model mismatch , Authors: Aurko Roy Huan Xu... , 2017\n13.  Title: A rate of convergence for mixture proportion estimation, with application to learning from noisy labels , Authors: Clayton Scott , 2015\n14.  Title: Classification with asymmetric label noise: Consistency and maximal denoising , Authors: Clayton Scott Gilles Blanchard... , 2013\n15.  Title: Distral: Robust multitask reinforcement learning , Authors: Yee Whye Teh Victor Bapst... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Recent studies have shown that reinforcement learning (RL) models can be vulnerable in various scenarios, where noises from different sources could appear. For instance, the observed reward channel is often subject to noise in practice (e.g., when observed rewards are collected through sensors), and thus observed rewards may not be credible. Also, in applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors. In this paper, we consider noisy RL problems where observed rewards by RL agents are generated with a reward confusion matrix. We call such observed rewards as perturbed rewards. We develop an unbiased reward estimator aided robust RL framework that enables RL agents to learn in noisy environments while observing only perturbed rewards. Our framework draws upon approaches for supervised learning with noisy data. The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 67.5% and 46.7% improvements in average on five Atari games, when the error rates are 10% and 30% respectively.", "Robust Reinforcement Learning It is known that RL algorithms are vulnerable to noisy environments (Irpan, 2018). Recent studies (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017) show that learned RL policies can be easily misled with small perturbations in observations. The presence of noise is very common in real-world environments, especially in robotics-relevant applications. Consequently, robust (adversarial) reinforcement learning (RRL/RARL) algorithms have been widely studied, aiming to train a robust policy which is capable of withstanding perturbed observations (Teh et al., 2017; Pinto et al., 2017; Gu et al., 2018) or transferring to unseen environments (Rajeswaran et al., 2016; Fu et al., 2017). However, these robust RL algorithms mainly focus on noisy vision observations, instead of the observed rewards. A couple of recent works (Lim et al., 2016; Roy et al., 2017) have also looked into a rather parallel question of training robust RL algorithms with uncertainty in models. Learning with Noisy Data Learning appropriately with biased data has received quite a bit of attention in recent machine learning studies Natarajan et al. (2013); Scott et al. (2013); Scott (2015); Sukhbaatar & Fergus (2014); van Rooyen & Williamson (2015); Menon et al. (2015). The idea of above line of works is to define unbiased surrogate loss function to recover the true loss using the knowledge of the noises. We adapt these approaches to reinforcement learning. Though intuitively the idea should apply in our RL settings, our work is the first one to formally establish this extension both theoretically and empirically. Our quantitative understandings will provide practical insights when implementing reinforcement learning algorithms in noisy environments.", "the paper focuses on an interesting and relatively unexplored direction for RL. Apart from the work cited by the paper (and perhaps work like Krueger et al. (2016), in which agents must pay some cost to observe true rewards), there is little work on learning settings of this kind. This paper represents a first step in gaining clarity on how to formalize and study this problem. I did, however, find the analysis and the experiments to be relatively disjointed -- the main sample complexity result presented by the paper (Theorem 2) was given for Phased Q-Learning, yet no experiments actually evaluate the performance of Phased Q-Learning. I think the paper could benefit from experiments focused on simple domains that showcase how traditional algorithms do in cases where it is easier to understand (and visualize) the impact of the reward perturbations (simple chain MDPs, grid worlds, etc.) -- and specifically experiments including Phased Q-Learning. \n\nSection 1) Introduction:\n\t- \"adversaries in real-world\" --> \"adversaries in the real-world\"\n\t- You might consider citing Loftin et al. (2014) regarding the bulleted point about \"Application-Specific Noise\".\n\t- \"unbiased reward estimator aided reward robust reinforcement learning framework\" --> this was a bit hard to parse. Consider making more concise, like: \"unbiased reward estimator for use in reinforcement learning with perturbed rewards\".\n\t- \"Our solution framework builds on existing reinforcement learning algorithms, including the recently developed DRL ones\" --> cite these up front So, cite: Q-Learning, CEM, SARSA, DQN, Dueling DQN, DDPG, NAF, and PPO, and spell out the acronym for each the first time you introduce them.\n\t- \"layer of explorations\" --> \"layer of exploration\"\n\nSection 2) Problem Formulation\n\t- \"as each shot of our\" --> what is 'shot' in this context?\n\t- \"In what follow,\" --> \"In what follows,\"\n\t- \"where 0 < \\gamma \\leq 1\" --> Usually, $\\gamma \\in [0,1)$, or $[0,1]$. Why can't $\\gamma = 0$?\n\t- The transition notation changes between $\\mathbb{P}_a(s_{t+1} | s_t)$ and $\\mathbb{P}(s_{t+1} | s_t, a_t)$. I'd suggest picking one and sticking with it to improve clarity.\n\t- \"to learn a state-action value function, for example the Q-function\" --> Why is the Q-function just an example? Isn't is *the* state-action value function? That is, I'd suggest replacing \"to learn a state-action value function, for example the Q-function\" with \"to learn a state-action value function, also called the Q-function\"\n\t- \"Q-function calculates\" --> \"The Q-function denotes\"\n\t- \"the reward feedbacks perfectly\" --> \"the reward feedback perfectly\"\n\t- I prefer that the exposition of the perturbed reward MDP be done with C in the tuple. So: $\\tilde{M} = \\langle \\mathcal{S}, \\mathcal{A}, \\mathcal{R}, C, \\mathcal{P}, \\gamma \\rangle$. This seems the most appropriate definition, since the observed rewards will be generated by $C$.\n\t- The setup of the confusion matrix for reward noise over is very clean. It might be worth pointing out that $C$ need not be Markovian. There are cases where C is not just a function of $\\mathcal{S}$ and $\\mathcal{R}$, like the adversarial case you describe early on.\n\n\n4) Experiments:\n\t- Diverse experiments! That's great. Lots of algorithms, lots of environment types.\n\t- I expected to see Phrased Q-Learning in the experiments. Why was it not included?\n\t- The plots are pretty jagged, so I'm left feeling a bit skeptical about some of the results. The results would be strengthened if the experiments were repeated for more trials.\n\n6: Marginally above acceptance threshold"], "input": "1 INTRODUCTION\nDesigning a suitable reward function plays a critical role in building reinforcement learning models for real-world applications. Ideally, one would want to customize reward functions to achieve application-specific goals (Hadfield-Menell et al., 2017). In practice, however, it is difficult to design a function that produces credible rewards in the presence of noise. This is because the output from any reward function is subject to multiple kinds of randomness:\n\u2022 Inherent Noise. For instance, sensors on a robot will be affected by physical conditions such as temperature and lighting, and therefore will report back noisy observed rewards.\n\u2022 Application-Specific Noise. In machine teaching tasks (Thomaz et al., 2006; Loftin et al., 2014), when an RL agent receives feedback/instructions from people, different human instructors might provide drastically different feedback due to their personal styles and capabilities. This way the RL agent (machine) will obtain reward with bias.\n\u2022 Adversarial Noise. Adversarial perturbation has been widely explored in different learning tasks and shows strong attack power against different machine learning models. For instance, Huang et al. (2017) has shown that by adding adversarial perturbation to each frame of the game, they can mislead RL policies arbitrarily.\nAssuming an arbitrary noise model makes solving this noisy RL problem extremely challenging. Instead, we focus on a specific noisy reward model which we call perturbed rewards, where the observed rewards by RL agents are generated according to a reward confusion matrix. This is not a very restrictive setting to start with, even considering that the noise could be adversarial: Given that arbitrary pixel value manipulation attack in RL is not very practical, adversaries in the real-world have high incentives to inject adversarial perturbation to the reward value by slightly modifying it. For instance, adversaries can manipulate sensors via reversing the reward value.\nIn this paper, we develop an unbiased reward estimator aided robust framework that enables an RL agent to learn in a noisy environment with observing only perturbed rewards. Our solution framework builds on existing reinforcement learning algorithms, including the recently developed DRL ones (Q-Learning (Watkins, 1989; Watkins & Dayan, 1992), Cross-Entropy Method (CEM) (Szita & Lo\u0308rincz, 2006), Deep SARSA (Sutton & Barto, 1998), Deep Q-Network (DQN) (Mnih et al., 2013; 2015; van Hasselt et al., 2016), Dueling DQN (DDQN) (Wang et al., 2016), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), Continuous DQN (NAF) (Gu et al., 2016) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithms).\nThe main challenge is that the observed rewards are likely to be biased, and in RL or DRL the accumulated errors could amplify the reward estimation error over time. We do not require any assumption on knowing the true distribution of reward or adversarial strategies, other than the fact that the generation of noises follow an unknown reward confusion matrix. Instead, we address the issue of estimating the reward confusion matrices by proposing an efficient and flexible estimation module. Everitt et al. (2017) provided preliminary studies for the noisy reward problem and gave some general negative results. The authors proved a No Free Lunch theorem, which is, without any assumption about what the reward corruption is, all agents can be misled. Our results do not contradict with the results therein, as we consider a specific noise generation model (that leads to a set of perturbed rewards). We analyze the convergence and sample complexity for the policy trained based on our proposed method using surrogate rewards in RL, using Q-Learning as an example.\nWe conduct extensive experiments on OpenAI Gym (Brockman et al., 2016) (AirRaid, Alien, Carnival, MsPacman, Pong, Phoenix, Seaquest) and show that the proposed reward robust RL method achieves comparable performance with the policy trained using the true rewards. In some cases, our method even achieves higher cumulative reward - this is surprising to us at first, but we conjecture that the inserted noise together with our noisy-removal unbiased estimator adds another layer of exploration, which proves to be beneficial in some settings. This merits a future study.\nOur contributions are summarized as follows: (1) We adapt and generalize the idea of defining a simple but effective unbiased estimator for true rewards using observed and perturbed rewards to the reinforcement learning setting. The proposed estimator helps guarantee the convergence to the optimal policy even when the RL agents only have noisy observations of the rewards. (2) We analyze the convergence to the optimal policy and finite sample complexity of our reward robust RL methods, using Q-Learning as the running example. (3) Extensive experiments on OpenAI Gym show that our proposed algorithms perform robustly even at high noise rates. 2 PROBLEM FORMULATION AND PRELIMINARIES\nIn this section, we define our problem of learning from perturbed rewards in reinforcement learning. Throughout this paper, we will use perturbed reward and noisy reward interchangeably, as each time step of our sequential decision making setting is similar to the \u201clearning with noisy data\u201d setting in supervised learning (Natarajan et al., 2013; Scott et al., 2013; Scott, 2015; Sukhbaatar & Fergus, 2014). In what follows, we formulate our Markov Decision Process (MDP) problem and the reinforcement learning (RL) problem with perturbed (noisy) rewards. 2.1 REINFORCEMENT LEARNING: THE NOISE-FREE SETTING\nOur RL agent interacts with an unknown environment and attempts to maximize the total of his collected reward. The environment is formalized as a Markov Decision Process (MDP), denoting as M = \u3008S,A,R,P, \u03b3\u3009. At each time t, the agent in state st \u2208 S takes an action at \u2208 A, which returns a reward r(st, at, st+1) \u2208 R (which we will also shorthand as rt), and leads to the next state st+1 \u2208 S according to a transition probability kernel P , which encodes the probability Pa(st, st+1). Commonly P is unknown to the agent. The agent\u2019s goal is to learn the optimal policy, a conditional distribution \u03c0(a|s) that maximizes the state\u2019s value function. The value function calculates the cumulative reward the agent is expected to receive given he would follow the current policy \u03c0 after observing the current state st: V \u03c0(s) = E\u03c0 [\u2211\u221e k=1 \u03b3 krt+k | st = s ] , where 0 \u2264 \u03b3 \u2264 11 is a discount factor. Intuitively, the agent evaluates how preferable each state is given the current policy. From the Bellman Equation, the optimal value function is given by V \u2217(s) = maxa\u2208A \u2211 st+1\u2208S Pa(st, st+1) [rt + \u03b3V\n\u2217(st+1)] . It is a standard practice for RL algorithms to learn a state-action value function, also called the Q-function. Q-function denotes the expected cumulative reward if agent chooses a in the current state and follows \u03c0 thereafter: Q\u03c0(s, a) = E\u03c0 [r(st, at, st+1) + \u03b3V \u03c0(st+1) | st = s, at = a] . 2.2 PERTURBED REWARD IN RL\nIn many practical settings, our RL agent does not observe the reward feedback perfectly. We consider the following MDP with perturbed reward, denoting as M\u0303 = \u3008S,A,R, C,P, \u03b3\u30092: instead of observing rt \u2208 R at each time t directly (following his action), our RL agent only observes a perturbed version of rt, denoting as r\u0303t \u2208 R\u0303. For most of our presentations, we focus on the cases whereR, R\u0303 are finite sets; but our results generalize to the continuous reward settings.\nThe generation of r\u0303 follows a certain function C : S\u00d7R \u2192 R\u0303. To let our presentation stay focused, we consider the following simple state-independent3 flipping error rates model: if the rewards are binary (consider r+ and r\u2212), r\u0303(st, at, st+1) (r\u0303t) can be characterized by the following noise rate parameters e+, e\u2212: e+ = P(r\u0303(st, at, st+1) = r\u2212|r(st, at, st+1) = r+), e\u2212 = P(r\u0303(st, at, st+1) = r+|r(st, at, st+1) = r\u2212). When the signal levels are beyond binary, suppose there are M outcomes in total, denoting as [R0, R1, \u00b7 \u00b7 \u00b7 , RM\u22121]. r\u0303t will be generated according to the following confusion matrix CM\u00d7M where each entry cj,k indicates the flipping probability for generating a perturbed outcome: cj,k = P(r\u0303t = Rk|rt = Rj). Again we\u2019d like to note that we focus on settings with finite reward levels for most of our paper, but we provide discussions in Section 3.1 on how to handle continuous rewards with discretizations.\nIn the paper, we do not assume knowing the noise rates (i.e., the reward confusion matrices), which is different from the assumption of knowing them as adopted in many supervised learning works Natarajan et al. (2013). Instead we will estimate the confusion matrices (Section 3.3).\n1\u03b3 = 1 indicates an undiscounted MDP setting (Schwartz, 1993; Sobel, 1994; Kakade, 2003). 2The MDP with perturbed reward can equivalently be defined as a tuple M\u0303 = \u3008S,A,R, R\u0303,P, \u03b3\u3009, with the\nperturbation function C implicitly defined as the difference betweenR and R\u0303. 3The case of state-dependent perturbed reward is discussed in Appendix C.3 3 LEARNING WITH PERTURBED REWARDS\nIn this section, we first introduce an unbiased estimator for binary rewards in our reinforcement learning setting when the error rates are known. This idea is inspired by Natarajan et al. (2013), but we will extend the method to the multi-outcome, as well as the continuous reward settings. 3.1 UNBIASED ESTIMATOR FOR TRUE REWARD\nWith the knowledge of noise rates (reward confusion matrices), we are able to establish an unbiased approximation of the true reward in a similar way as done in Natarajan et al. (2013). We will call such a constructed unbiased reward as a surrogate reward. To give an intuition, we start with replicating the results for binary rewardR = {r\u2212, r+} in our RL setting: Lemma 1. Let r be bounded. Then, if we define,\nr\u0302(st, at, st+1) :=\n{ (1\u2212e\u2212)\u00b7r+\u2212e+\u00b7r\u2212\n1\u2212e+\u2212e\u2212 (r\u0303(st, at, st+1) = r+)\n(1\u2212e+)\u00b7r\u2212\u2212e\u2212\u00b7r+ 1\u2212e+\u2212e\u2212\n(r\u0303(st, at, st+1) = r\u2212) (1)\nwe have for any r(st, at, st+1), Er\u0303|r[r\u0302(st, at, st+1)] = r(st, at, st+1).\nIn the standard supervised learning setting, the above property guarantees convergence - as more training data are collected, the empirical surrogate risk converges to its expectation, which is the same as the expectation of the true risk (due to unbiased estimators). This is also the intuition why we would like to replace the reward terms with surrogate rewards in our RL algorithms.\nThe above idea can be generalized to the multi-outcome setting in a fairly straight-forward way. Define R\u0302 := [r\u0302(r\u0303 = R0), r\u0302(r\u0303 = R1), ..., r\u0302(r\u0303 = RM\u22121)], where r\u0302(r\u0303 = Rm) denotes the value of the surrogate reward when the observed reward is Rk. Let R = [R0;R1; \u00b7 \u00b7 \u00b7 ;RM\u22121] be the bounded reward matrix with M values. We have the following results: Lemma 2. Suppose CM\u00d7M is invertible. With defining:\nR\u0302 = C\u22121 \u00b7R. (2) we have for any r(st, at, st+1), Er\u0303|r[r\u0302(st, at, st+1)] = r(st, at, st+1).\nContinuous reward When the reward signal is continuous, we discretize it into M intervals and view each interval as a reward level, with its value approximated by its middle point. With increasing M , this quantization error can be made arbitrarily small. Our method is then the same as the solution for the multi-outcome setting, except for replacing rewards with discretized ones. Note that the finerdegree quantization we take, the smaller the quantization error - but we would suffer from learning a bigger reward confusion matrix. This is a trade-off question that can be addressed empirically.\nSo far we have assumed knowing the confusion matrices, but we will address this additional estimation issue in Section 3.3, and present our complete algorithm therein.\n3.2 CONVERGENCE AND SAMPLE COMPLEXITY: Q-LEARNING\nWe now analyze the convergence and sample complexity of our surrogate reward based RL algorithms (with assuming knowing C), taking Q-Learning as an example. Convergence guarantee First, the convergence guarantee is stated in the following theorem: Theorem 1. Given a finite MDP, denoting as M\u0302 = \u3008S,A, R\u0302,P, \u03b3\u3009, the Q-learning algorithm with surrogate rewards, given by the update rule,\nQt+1(st, at) = (1\u2212 \u03b1t)Q(st, at) + \u03b1t [ r\u0302t + \u03b3max\nb\u2208A Q(st+1, b)\n] , (3)\nconverges w.p.1 to the optimal Q-function as long as \u2211 t \u03b1t =\u221e and \u2211 t \u03b1 2 t <\u221e.\nNote that the term on the right hand of Eqn. (3) includes surrogate reward r\u0302 estimated using Eqn. (1) and Eqn. (2). Theorem 1 states that that agents will converge to the optimal policy w.p.1 with replacing the rewards with surrogate rewards, despite of the noises in observing rewards. This result is not surprising - though the surrogate rewards introduce larger variance, we are grateful of their unbiasedness, which grants us the convergence. In other words, the addition of the perturbed reward does not destroy the convergence guarantees of Q-Learning.\nSample complexity To establish our sample complexity results, we first introduce a generative model following previous literature (Kearns & Singh, 1998; 2000; Kearns et al., 1999). This is a practical MDP setting to simplify the analysis. Definition 1. A generative model G(M) for an MDPM is a sampling model which takes a stateaction pair (st, at) as input, and outputs the corresponding reward r(st, at) and the next state st+1 randomly with the probability of Pa(st, st+1), i.e., st+1 \u223c P(\u00b7|s, a).\nExact value iteration is impractical if the agents follow the generative models above exactly (Kakade, 2003). Consequently, we introduce a phased Q-Learning which is similar to the ones presented in Kakade (2003); Kearns & Singh (1998) for the convenience of proving our sample complexity results. We briefly outline phased Q-Learning as follows - the complete description (Algorithm 2) can be found in Appendix A. Definition 2. Phased Q-Learning algorithm takesm samples per phase by calling generative model G(M). It uses the collected m samples to estimate the transition probability P and update the estimated value function per phase. Calling generative model G(M\u0302) means that surrogate rewards are returned and used to update value function per phase.\nThe sample complexity of Phased Q-Learning is given as follows: Theorem 2. (Upper Bound) Let r \u2208 [0, Rmax] be bounded reward, C be an invertible reward confusion matrix with det(C) denoting its determinant. For an appropriate choice ofm, the Phased Q-Learning algorithm calls the generative model G(M\u0302) O ( |S||A|T\n2(1\u2212\u03b3)2det(C)2 log |S||A|T \u03b4\n) times in T epochs, and returns a policy such that for all state s \u2208 S , |V\u03c0(s)\u2212 V \u2217(s)| \u2264 , > 0, w.p.\n\u2265 1\u2212 \u03b4, 0 < \u03b4 < 1.\nTheorem 2 states that, to guarantee the convergence to the optimal policy, the number of samples needed is no more than O(1/det(C)2) times of the one needed when the RL agent observes true rewards perfectly. This additional constant is the price we pay for the noise presented in our learning environment. When the noise level is high, we expect to see a much higher 1/det(C)2; otherwise when we are in a low-noise regime , Q-Learning can be very efficient with surrogate reward (Kearns & Singh, 2000). Note that Theorem 2 gives the upper bound in discounted MDP setting; for undiscounted setting (\u03b3 = 1), the upper bound is at the order of O ( |S||A|T 3 2det(C)2 log |S||A|T \u03b4 ) . Lower bound result is omitted due to the lack of space. The idea of constructing MDP in which learning is difficult and the algorithm must make ( |S||A|T\nlog 1 \u03b4\n) calls to G(M\u0302), is similar to Kakade (2003).\nWhile the surrogate reward guarantees the unbiasedness, we sacrifice the variance at each of our learning steps, and this in turn delays the convergence (as also evidenced in the sample complexity bound). It can be verified that the variance of surrogate reward is bounded when C is invertible, and it is always higher than the variance of true reward. This is summarized in the following theorem: Theorem 3. Let r \u2208 [0, Rmax] be bounded reward and confusion matrix C is invertible. Then, the variance of surrogate reward r\u0302 is bounded as follows: Var(r) \u2264 Var(r\u0302) \u2264 M 2\ndet(C)2 \u00b7R 2 max.\nTo give an intuition of the bound, when we have binary reward, the variance for surrogate reward bounds as follows: Var(r) \u2264 Var(r\u0302) \u2264 4R 2 max\n(1\u2212e+\u2212e\u2212)2 . As e\u2212 + e+ \u2192 1, the variance becomes unbounded and the proposed estimator is no longer effective, nor will it be well-defined. In practice, there is a trade-off question between bias and variance by tuning a linear combination of R and R\u0302, i.e., Rproxy = \u03b7R + (1\u2212 \u03b7)R\u0302, and choosing an appropriate \u03b7 \u2208 [0, 1]. 3.3 ESTIMATION OF CONFUSION MATRICES\nIn Section 3.1 we have assumed the knowledge of reward confusion matrices, in order to compute the surrogate reward. This knowledge is often not available in practice. Estimating these confusion matrices is challenging without knowing any ground truth reward information; but we\u2019d like to note that efficient algorithms have been developed to estimate the confusion matrices in supervised learning settings (Bekker & Goldberger, 2016; Liu & Liu, 2017; Khetan et al., 2017; Hendrycks et al., 2018). The idea in these algorithms is to dynamically refine the error rates based on aggregated rewards. Note this approach is not different from the inference methods in aggregating crowdsourcing\nlabels, as referred in the literature (Dawid & Skene, 1979; Karger et al., 2011; Liu et al., 2012). We adapt this idea to our reinforcement learning setting, which is detailed as follows.\nAt each training step, the RL agent collects the noisy reward and the current state-action pair. Then, for each pair in S \u00d7 A, the agent predicts the true reward based on accumulated historical observations of reward for the corresponding state-action pair via, e.g., averaging (majority voting). Finally, with the predicted true reward and the accuracy (error rate) for each state-action pair, the estimated reward confusion matrices C\u0303 are given by\nc\u0303i,j =\n\u2211 (s,a)\u2208S\u00d7A# [r\u0303(s, a) = Rj |r\u0304(s, a) = Ri]\u2211\n(s,a)\u2208S\u00d7A#[r\u0304(s, a) = Ri] , (4)\nwhere in above # [\u00b7] denotes the number of state-action pair that satisfies the condition [\u00b7] in the set of observed rewards R\u0303(s, a) (see Algorithm 1 and 3); r\u0304(s, a) and r\u0303(s, a) denote predicted true rewards (using majority voting) and observed rewards when the state-action pair is (s, a). The above procedure of updating c\u0303i,j continues indefinitely as more observation arrives.\nAlgorithm 1 Reward Robust RL (sketch)\nInput: M\u0303, \u03b1, \u03b2, R\u0303(s, a) Output: Q(s), \u03c0(s, t)\nInitialize value function Q(s, a) arbitrarily. while Q is not converged do\nInitialize state s \u2208 S while s is not terminal do\nChoose a from s using policy derived from Q Take action a, observe s\u2032 and noisy reward r\u0303 if collecting enough r\u0303 for every S \u00d7A pair then\nGet predicted true reward r\u0304 using majority voting Estimate confusion matrix C\u0303 based on r\u0303 and r\u0304 (Eqn. 4) Obtain surrogate reward r\u0307 (R\u0302 = (1\u2212 \u03b7) \u00b7R + \u03b7 \u00b7C\u22121R) Update Q using surrogate reward s\u2190 s\u2032\nreturn Q(s) and \u03c0(s)\nOur final definition of surrogate reward replaces a known reward confusion C in Eqn. (2) with our estimated one C\u0303. We denote this estimated surrogate reward as r\u0307.\nWe present (Reward Robust RL) in Algorithm 14. Note that the algorithm is rather generic, and we can plug in any exisitng RL algorithm into our reward robust one, with only changes in replacing the rewards with our estimated surrogate rewards. 4 EXPERIMENTS\nIn this section, reward robust RL is tested in different games, with different noise settings. Due to space limit, more experimental results can be found in Appendix D. 4.1 EXPERIMENTAL SETUP\nEnvironments and RL Algorithms To fully test the performance under different environments, we evaluate the proposed robust reward RL method on two classic control games (CartPole, Pendulum) and seven Atari 2600 games (AirRaid, Alien, Carnival, MsPacman, Pong, Phoenix, Seaquest), which encompass a large variety of environments, as well as rewards. Specifically, the rewards could be unary (CartPole), binary (most of Atari games), multivariate (Pong) and even continuous (Pendulum). A set of state-of-the-art reinforcement learning algorithms are experimented with while training under different amounts of noise (See Table 3)5. For each game and algorithm, three policies are trained based on different random initialization to decrease the variance.\nReward Post-Processing For each game and RL algorithm, we test the performances for learning with true rewards, learning with noisy rewards and learning with surrogate rewards. Both symmetric and asymmetric noise settings with different noise levels are tested. For symmetric noise, the confusion matrices are symmetric. As for asymmetric noise, two types of random noise are tested: 1) rand-one, each reward level can only be perturbed into another reward; 2) rand-all, each reward could be perturbed to any other reward, via adding a random noise matrix. To measure the amount of noise w.r.t confusion matrices, we define the weight of noise \u03c9 in Appendix B.2. The larger \u03c9 is, the higher the noise rates are.\n4One complete Q-Learning implementation (Algorithm 3) is provided in Appendix C.1. 5The detailed settings are accessible in Appendix B. 4.2 ROBUSTNESS EVALUATION\nCartPole The goal in CartPole is to prevent the pole from falling by controlling the cart\u2019s direction and velocity. The reward is +1 for every step taken, including the termination step. When the cart or pole deviates too much or the episode length is longer than 200, the episode terminates. Due to the unary reward {+1} in CartPole, a corrupted reward \u22121 is added as the unexpected error (e\u2212 = 0). As a result, the reward space R is extended to {+1,\u22121}. Five algorithms Q-Learning (1992), CEM (2006), SARSA (1998), DQN (2016) and DDQN (2016) are evaluated.\nrewards (r\u0303) and estimated surrogate rewards (r\u0307) (\u03b7 = 1) . Note that reward confusion matrices C are unknown to the agents here. Full results are in Appendix D.2 (Figure 6).\nIn Figure 1, we show that our estimator successfully produces meaningful surrogate rewards that adapt the underlying RL algorithms to the noisy settings, without any assumption of the true distribution of rewards. With the noise rate increasing (from 0.1 to 0.9), the models with noisy rewards converge slower due to larger biases. However, we observe that the models always converge to the best score 200 with the help of surrogate rewards.\nIn some circumstances (slight noise - see Figure 6a, 6b, 6c, 6d), the surrogate rewards even lead to faster convergence. This points out an interesting observation: learning with surrogate reward even outperforms the case with observing the true reward. We conjecture that the way of adding noise and then removing the bias introduces implicit exploration. This implies that for settings even with true reward, we might consider manually adding noise and then remove it in expectation.\nPendulum The goal in Pendulum is to keep a frictionless pendulum standing up. Different from the CartPole setting, the rewards in pendulum are continuous: r \u2208 (\u221216.28, 0.0]. The closer the reward is to zero, the better performance the model achieves. Following our extension (see Section 3.1), the (\u221217, 0] is firstly discretized into 17 intervals: (\u221217,\u221216], (\u221216,\u221215], \u00b7 \u00b7 \u00b7 , (\u22121, 0], with its value approximated using its maximum point. After the quantization step, the surrogate rewards can be estimated using multi-outcome extensions presented in Section 3.1.\nWe experiment two popular algorithms, DDPG (2015) and NAF (2016) in this game. In Figure 2, both algorithms perform well with surrogate rewards under different amounts of noise. In most cases, the biases were corrected in the long-term, even when the amount of noise is extensive (e.g., \u03c9 = 0.7). The quantitative scores on CartPole and Pendulum are given in Table 1, where the\nrewards (r\u0303) and surrogate rewards (r\u0302) (\u03b7 = 1) . Both symmetric and asymmetric noise are conduced in the experiments. Full results are in Appendix D.2 (Figure 8).\nscores are averaged based on the last thirty episodes. The full results (\u03c9 > 0.5) can be found in Appendix D.1, so does Table 2. Our reward robust RL method is able to achieve consistently good scores. Atari We validate our algorithm on seven Atari 2600 games using the state-of-the-art algorithm PPO (Schulman et al., 2017). The games are chosen to cover a variety of environments. The rewards in the Atari games are clipped into {\u22121, 0, 1}. We leave the detailed settings to Appendix B.\nand surrogate rewards (\u03b7 = 1) (r\u0302) . The noise rates increase from 0.1 to 0.9, with a step of 0.1.\nResults for PPO on Pong-v4 in symmetric noise setting are presented in Figure 3. Due to limited space, more results on other Atari games and noise settings are given in Appendix D.3. Similar to previous results, our surrogate estimator performs consistently well and helps PPO converge to the optimal policy. Table 2 shows the average scores of PPO on five selected Atari games with different amounts of noise (symmetric & asymmetric). In particular, when the noise rates e+ = e\u2212 > 0.3, agents with surrogate rewards obtain significant amounts of improvements in average scores. We do not present the results for the case with unknown C because the state-space (image-input) is very large for Atari games, which is difficult to handle with the solution given in Section 3.3. 5 CONCLUSION\nOnly an underwhelming amount of reinforcement learning studies have focused on the settings with perturbed and noisy rewards, despite the fact that such noises are common when exploring a realworld scenario, that faces sensor errors or adversarial examples. We adapt the ideas from supervised\nlearning with noisy examples (Natarajan et al., 2013), and propose a simple but effective RL framework for dealing with noisy rewards. The convergence guarantee and finite sample complexity of Q-Learning (or its variant) with estimated surrogate rewards are given. To validate the effectiveness of our approach, extensive experiments are conducted on OpenAI Gym, showing that surrogate rewards successfully rescue models from misleading rewards even at high noise rates. A PROOFS\nProof of Lemma 1. For simplicity, we shorthand r\u0302(st, at, st+1), r\u0303(st, at, st+1), r(st, at, st+1) as r\u0302, r\u0303, r, and let r+, r\u2212, r\u0302+, r\u0302\u2212 denote the general reward levels and corresponding surrogate ones:\nEr\u0303|r(r\u0302) = Pr\u0303|r(r\u0302 = r\u0302\u2212)r\u0302\u2212 + Pr\u0303|r(r\u0302 = r\u0302+)r\u0302+. (5) When r = r+, from the definition in Lemma 1:\nPr\u0303|r(r\u0302 = r\u0302\u2212) = e+, Pr\u0303|r(r\u0302 = r\u0302+) = 1\u2212 e+. Taking the definition of surrogate rewards Eqn. (1) into Eqn. (5), we have\nEr\u0303|r(r\u0302) = e+ \u00b7 r\u0302\u2212 + (1\u2212 e+) \u00b7 r\u0302+\n= e+ \u00b7 (1\u2212 e+)r\u2212 \u2212 e\u2212r+\n1\u2212 e\u2212 \u2212 e+ + (1\u2212 e+) \u00b7 (1\u2212 e\u2212)r+ \u2212 e+r\u2212 1\u2212 e\u2212 \u2212 e+ = r+.\nSimilarly, when r = r\u2212, it also verifies Er\u0303|r[r\u0302(st, at, st+1)] = r(st, at, st+1).\nProof of Lemma 2. The idea of constructing unbiased estimator is easily adapted to multi-outcome reward settings via writing out the conditions for the unbiasedness property (s.t. Er\u0303|r[r\u0302] = r.). For simplicity, we shorthand r\u0302(r\u0303 = Ri) as R\u0302i in the following proofs. Similar to Lemma 1, we need to solve the following set of functions to obtain r\u0302: R0 = c0,0 \u00b7 R\u03020 + c0,1 \u00b7 R\u03021 + \u00b7 \u00b7 \u00b7+ c0,M\u22121 \u00b7 R\u0302M\u22121 R1 = c1,0 \u00b7 R\u03020 + c1,1 \u00b7 R\u03021 + \u00b7 \u00b7 \u00b7+ c1,M\u22121 \u00b7 R\u0302M\u22121 \u00b7 \u00b7 \u00b7\nRM\u22121 = cM\u22121,0 \u00b7 R\u03020 + cM\u22121,1 \u00b7 R\u03021 + \u00b7 \u00b7 \u00b7+ cM\u22121,M\u22121 \u00b7 R\u0302M\u22121\nwhere R\u0302i denotes the value of the surrogate reward when the observed reward is Ri. Define R := [R0;R1; \u00b7 \u00b7 \u00b7 ;RM\u22121], and R\u0302 := [R\u03020, R\u03021, ..., R\u0302M\u22121], then the above equations are equivalent to: R = C \u00b7 R\u0302. If the confusion matrix C is invertible, we obtain the surrogate reward:\nR\u0302 = C\u22121 \u00b7R. According to above definition, for any true reward level Ri, i = 0, 1, \u00b7 \u00b7 \u00b7 ,M \u2212 1, we have\nEr\u0303|r=Ri [r\u0302] = ci,0 \u00b7 R\u03020 + ci,1 \u00b7 R\u03021 + \u00b7 \u00b7 \u00b7+ ci,M\u22121 \u00b7 R\u0302M\u22121 = Ri.\nFurthermore, the probabilities for observing surrogate rewards can be written as follows:\nP\u0302 = [p\u03021, p\u03022, \u00b7 \u00b7 \u00b7 , p\u0302M ] = \u2211 j pjcj,1, \u2211 j pjcj,2, \u00b7 \u00b7 \u00b7 , \u2211 j pjcj,M  , where p\u0302i = \u2211 j pjcj,i, and p\u0302i, pi represent the probabilities of occurrence for surrogate reward R\u0302i and true reward Ri respectively. Corollary 1. Let p\u0302i and pi denote the probabilities of occurrence for surrogate reward r\u0302(r\u0303 = Ri) and true reward Ri. Then the surrogate reward satisfies,\u2211\ns\u2032\u2208S Pa(st, st+1)r(st, a, st+1) = \u2211 j pjRj = \u2211 j p\u0302jR\u0302j . (6)\nProof of Corollary 1. From Lemma 2, we have,\u2211 st\u2208S Pa(st, st+1)r(st, a, st+1) = \u2211 st+1\u2208S;Rj\u2208R Pa(st, st+1, Rj)Rj\n= \u2211 Rj\u2208R \u2211 st+1\u2208S Pa(st, st+1)Rj = \u2211 Rj\u2208R pjRj = \u2211 j pjRj .\nConsequently, \u2211 j p\u0302jR\u0302j = \u2211 j \u2211 k pkck,jR\u0302j = \u2211 k pk \u2211 j ck,jR\u0302j\n= \u2211 k pkRk = \u2211 st\u2208S Pa(st, st+1)r(st, a, st+1).\nTo establish Theorem 1, we need an auxiliary result (Lemma 3) from stochastic process approximation, which is widely adopted for the convergence proof for Q-Learning (Jaakkola et al., 1993; Tsitsiklis, 1994). Lemma 3. The random process {\u2206t} taking values in Rn and defined as\n\u2206t+1(x) = (1\u2212 \u03b1t(x))\u2206t(x) + \u03b1t(x)Ft(x) converges to zero w.p.1 under the following assumptions:\n\u2022 0 \u2264 \u03b1t \u2264 1, \u2211 t \u03b1t(x) =\u221e and \u2211 t \u03b1t(x) 2 <\u221e;\n\u2022 ||E [Ft(x)|Ft] ||W \u2264 \u03b3||\u2206t||, with \u03b3 < 1;\n\u2022 var [Ft(x)|Ft] \u2264 C(1 + ||\u2206t||2W ), for C > 0.\nHere Ft = {\u2206t,\u2206t\u22121, \u00b7 \u00b7 \u00b7 , Ft\u22121 \u00b7 \u00b7 \u00b7 , \u03b1t, \u00b7 \u00b7 \u00b7 } stands for the past at step t, \u03b1t(x) is allowed to depend on the past insofar as the above conditions remain valid. The notation || \u00b7 ||W refers to some weighted maximum norm.\nProof of Lemma 3. See previous literature (Jaakkola et al., 1993; Tsitsiklis, 1994).\nProof of Theorem 1. For simplicity, we abbreviate st, st+1, Qt, Qt+1, rt, r\u0302t and \u03b1t as s, s\u2032, Q, Q\u2032, r, r\u0302, and \u03b1, respectively.\nSubtracting from both sides the quantity Q\u2217(s, a) in Eqn. (3): Q\u2032(s, a)\u2212Q\u2217(s, a) = (1\u2212 \u03b1) (Q(s, a)\u2212Q\u2217(s, a)) + \u03b1 [ r\u0302 + \u03b3max\nb\u2208A Q(s\u2032, b)\u2212Q\u2217(s, a)\n] .\nLet \u2206t(s, a) = Q(s, a)\u2212Q\u2217(s, a) and Ft(s, a) = r\u0302 + \u03b3maxb\u2208AQ(s\u2032, b)\u2212Q\u2217(s, a). \u2206t+1(s\n\u2032, a) = (1\u2212 \u03b1)\u2206t(s, a) + \u03b1Ft(s, a). In consequence,\nE [Ft(x)|Ft] = \u2211\ns\u2032\u2208S;r\u0302\u2208R\nPa(s, s\u2032, r\u0302) [ r\u0302 + \u03b3max\nb\u2208A Q(s\u2032, b)\n] \u2212Q\u2217(s, a)\n= \u2211\ns\u2032\u2208S;r\u0302\u2208R Pa(s, s\u2032, r\u0302)r\u0302 + \u2211 s\u2032\u2208S Pa(s, s\u2032) [ \u03b3max b\u2208A Q(s\u2032, b)\u2212 r \u2212 \u03b3max b\u2208A Q\u2217(s\u2032, b) ]\n= \u2211\ns\u2032\u2208S;r\u0302\u2208R Pa(s, s\u2032, r\u0302)r\u0302 \u2212 \u2211 s\u2032\u2208S Pa(s, s\u2032)r + \u2211 s\u2032\u2208S Pa(s, s\u2032)\u03b3 [ max b\u2208A Q(s\u2032, b)\u2212max b\u2208A Q\u2217(s\u2032, b) ]\n= \u2211 j p\u0302j r\u0302j \u2212 \u2211 s\u2032\u2208S Pa(s, s\u2032)r + \u2211 s\u2032\u2208S Pa(s, s\u2032)\u03b3 [ max b\u2208A Q(s\u2032, b)\u2212max b\u2208A Q\u2217(s\u2032, b) ]\n= \u2211 s\u2032\u2208S Pa(s, s\u2032)\u03b3 [ max b\u2208A Q(s\u2032, b)\u2212max b\u2208A Q\u2217(s\u2032, b) ] (using Eqn. (6))\n\u2264 \u03b3 \u2211 s\u2032\u2208S Pa(s, s\u2032) max b\u2208A,s\u2032\u2208S |Q(s\u2032, b)\u2212Q\u2217(s\u2032, b)|\n= \u03b3 \u2211 s\u2032\u2208S Pa(s, s\u2032)||Q\u2212Q\u2217||\u221e = \u03b3||Q\u2212Q\u2217||\u221e = \u03b3||\u2206t||\u221e.\nFinally,\nVar [Ft(x)|Ft] = E  r\u0302 + \u03b3max\nb\u2208A Q(s\u2032, b)\u2212 \u2211 s\u2032\u2208S;r\u0302\u2208R P\u2032(s, s\u2032, r\u0302) [ r\u0302 + \u03b3max b\u2208A Q(s\u2032, b)\n]2 \n= Var [ r\u0302 + \u03b3max\nb\u2208A Q(s\u2032, b)|Ft ] .\nBecause r\u0302 is bounded, it can be clearly verified that\nVar [Ft(x)|Ft] \u2264 C(1 + ||\u2206t||2W )\nfor some constantC. Then, due to the Lemma 3, \u2206t converges to zero w.p.1, i.e.,Q\u2032(s, a) converges to Q\u2217(s, a).\nThe procedure of Phased Q-Learning is described as Algorithm 2:\nAlgorithm 2 Phased Q-Learning Input: G(M): generative model ofM = (S,A,R,P, \u03b3), T : number of iterations. Output: V\u0302 (s): value function, \u03c0\u0302(s, t): policy function.\n1: Set V\u0302T (s) = 0 2: for t = T \u2212 1, \u00b7 \u00b7 \u00b7 , 0 do\n1. Calling G(M) m times for each state-action pair.\nP\u0302a(st, st+1) = #[(st, at)\u2192 st+1]\nm\n2. Set\nV\u0302 (s) = max a\u2208A \u2211 st+1\u2208S P\u0302a(st, st+1) [ rt + \u03b3V\u0302 (st+1) ] \u03c0\u0302(s, t) = arg max\na\u2208A V\u0302 (s)\n3: return V\u0302 (s) and \u03c0\u0302(s, t)\nNote that P\u0302 here is the estimated transition probability, which is different from P in Eqn. (6).\nTo obtain the sample complexity results, the range of our surrogate reward needs to be known. Assuming reward r is bounded in [0, Rmax], Lemma 4 below states that the surrogate reward is also bounded, when the confusion matrices are invertible:\nLemma 4. Let r \u2208 [0, Rmax] be bounded, whereRmax is a constant; suppose CM\u00d7M , the confusion matrix, is invertible with its determinant denoting as det(C). Then the surrogate reward satisfies\n0 \u2264 |r\u0302| \u2264 M det(C) Rmax. (7)\nProof of Lemma 4. From Eqn. (2), we have,\nR\u0302 = C\u22121 \u00b7R = adj(C) det(C) \u00b7R,\nwhere adj(C) is the adjugate matrix of C; det(C) is the determinant of C. It is known from linear algebra that,\nadj(C)ij = (\u22121)i+j \u00b7Mji,\nwhere Mji is the determinant of the (M \u2212 1)\u00d7 (M \u2212 1) matrix that results from deleting row j and column i of C. Therefore, Mji is also bounded:\nMji \u2264 \u2211 \u03c3\u2208Sn\n( |sgn(\u03c3)|\n\u220f m=1 c\u2032m,\u03c3n ) \u2264 M\u22121\u220f m=0 ( M\u22121\u2211 n=0 cm,n ) = 1M = 1,\nwhere the sum is computed over all permutations \u03c3 of the set {0, 1, \u00b7 \u00b7 \u00b7 ,M \u2212 2}; c\u2032 is the element of Mji; sgn(\u03c3) returns a value that is +1 whenever the reordering given by \u03c3 can be achieved by successively interchanging two entries an even number of times, and \u22121 whenever it can not. Consequently, \u2223\u2223\u2223R\u0302i\u2223\u2223\u2223 = \u2211j |adj(C)ij | \u00b7 |Rj |\ndet(C) \u2264 M det(C) \u00b7Rmax.\nProof of Theorem 2. From Hoeffding\u2019s inequality, we obtain:\nP \u2223\u2223\u2223\u2223\u2223\u2223 \u2211 st+1\u2208S Pa(st, st+1)V \u2217t+1(st+1)\u2212 \u2211 st+1\u2208S P\u0302a(st, st+1)V \u2217t+1(st+1) \u2223\u2223\u2223\u2223\u2223\u2223 \u2265 \n\u2264 2 exp ( \u22122m 2(1\u2212 \u03b3)2\nR2max\n) ,\nbecause Vt(st) is bounded within Rmax1\u2212\u03b3 . In the same way, r\u0302t is bounded by M det(C) \u00b7 Rmax from Lemma 4. We then have,\nP  \u2223\u2223\u2223\u2223\u2223\u2223\u2223\u2223 \u2211\nst+1\u2208S r\u0302t\u2208R\u0302\nPa(st, st+1, r\u0302t)r\u0302t \u2212 \u2211\nst+1\u2208S r\u0302t\u2208R\u0302\nP\u0302a(st, st+1, r\u0302t)r\u0302t \u2223\u2223\u2223\u2223\u2223\u2223\u2223\u2223 \u2265  \u2264 2 exp(\u22122m 2det(C)2M2R2max ) .\nFurther, due to the unbiasedness of surrogate rewards, we have\u2211 st+1\u2208S Pa(st, st+1)rt = \u2211\nst+1\u2208S;r\u0302t\u2208R\u0302\nPa(st, st+1, r\u0302t)r\u0302t.\nAs a result,\u2223\u2223\u2223V \u2217t (s)\u2212 V\u0302t(s)\u2223\u2223\u2223 = max a\u2208A \u2211 st+1\u2208S Pa(st, st+1) [ rt + \u03b3V \u2217 t+1(st+1) ] \u2212max\na\u2208A \u2211 st+1\u2208S P\u0302a(st, st+1) [ r\u0302t + \u03b3V \u2217 t+1(st+1) ]\n\u2264 1 + \u03b3max a\u2208A \u2223\u2223\u2223\u2223\u2223\u2223 \u2211 st+1\u2208S Pa(st, st+1)V \u2217t+1(st+1)\u2212 \u2211 st+1\u2208S P\u0302a(st, st+1)V \u2217t+1(st+1) \u2223\u2223\u2223\u2223\u2223\u2223 + max\na\u2208A \u2223\u2223\u2223\u2223\u2223\u2223 \u2211 st+1\u2208S Pa(st, st+1)rt \u2212 \u2211 st+1\u2208S;r\u0302t\u2208R\u0302 Pa(st, st+1, r\u0302t)r\u0302t \u2223\u2223\u2223\u2223\u2223\u2223 \u2264 \u03b3max\ns\u2208S \u2223\u2223\u2223V \u2217t+1(s)\u2212 V\u0302t+1(s)\u2223\u2223\u2223+ 1 + \u03b3 2 In the same way, \u2223\u2223\u2223Vt(s)\u2212 V\u0302t(s)\u2223\u2223\u2223 \u2264 \u03b3max\ns\u2208S\n\u2223\u2223\u2223V \u2217t+1(s)\u2212 V\u0302t+1(s)\u2223\u2223\u2223+ 1 + \u03b3 2\nRecursing the two equations in two directions (0\u2192 T ), we get\nmax s\u2208S \u2223\u2223\u2223V \u2217(s)\u2212 V\u0302 (s)\u2223\u2223\u2223 \u2264 ( 1 + \u03b3 2) + \u03b3( 1 + \u03b3 2) + \u00b7 \u00b7 \u00b7+ \u03b3T\u22121( 1 + \u03b3 2) =\n( 1 + \u03b3 2)(1\u2212 \u03b3T ) 1\u2212 \u03b3\nmax s\u2208S \u2223\u2223\u2223V (s)\u2212 V\u0302 (s)\u2223\u2223\u2223 \u2264 ( 1 + \u03b3 2)(1\u2212 \u03b3T ) 1\u2212 \u03b3\nCombining these two inequalities above we have:\nmax s\u2208S |V \u2217(s)\u2212 V (s)| \u2264 2( 1 + \u03b3 2)(1\u2212 \u03b3 T ) 1\u2212 \u03b3 \u2264 2( 1 + \u03b3 2) 1\u2212 \u03b3 .\nLet 1 = 2, so maxs\u2208S |V \u2217(s)\u2212 V (s)| \u2264 as long as\n1 = 2 \u2264 (1\u2212 \u03b3) 2(1 + \u03b3) .\nFor arbitrarily small , by choosing m appropriately, there always exists 1 = 2 = (1\u2212\u03b3) 2(1+\u03b3) such that the policy error is bounded within . That is to say, the Phased Q-Learning algorithm can converge to the near optimal policy within finite steps using our proposed surrogate rewards.\nFinally, there are |S||A|T transitions under which these conditions must hold, where | \u00b7 | represent the number of elements in a specific set. Using a union bound, the probability of failure in any condition is smaller than\n2|S||A|T \u00b7 exp ( \u2212m 2(1\u2212 \u03b3)2\n2(1 + \u03b3)2 \u00b7min{(1\u2212 \u03b3)2, det(C)\n2 M2 } ) .\nWe set the error rate less than \u03b4, and m should satisfy that\nm = O\n( 1\n2(1\u2212 \u03b3)2det(C)2 log |S||A|T \u03b4\n) .\nIn consequence, afterm|S||A|T calls, which is,O (\n|S||A|T 2(1\u2212\u03b3)2det(C)2 log |S||A|T \u03b4\n) , the value function\nconverges to the optimal one for every state s, with probability greater than 1\u2212 \u03b4.\nThe above bound is for discounted MDP setting with 0 \u2264 \u03b3 < 1. For undiscounted setting \u03b3 = 1, since the total error (for entire trajectory of T time-steps) has to be bounded by , therefore, the error for each time step has to be bounded by T . Repeating our anayslis, we obtain the following upper bound:\nO\n( |S||A|T 3\n2det(C)2 log |S||A|T \u03b4\n) . Proof of Theorem 3.\nVar(r\u0302)\u2212Var(r) = E [ (r\u0302 \u2212 E[r\u0302])2 ] \u2212 E [ (r \u2212 E[r])2 ] = E[r\u03022]\u2212 E[r\u0302]2 + E[r2]\u2212 E[r]2\n= \u2211 j p\u0302jR\u0302j 2 \u2212 \u2211 j p\u0302jR\u0302j 2 \u2212 \u2211 j pjRj 2 \u2212 \u2211 j pjRj 2  = \u2211 j p\u0302jR\u0302j 2 \u2212 \u2211 j pjRj 2 = \u2211 j \u2211 i pici,jR\u0302j 2 \u2212 \u2211 j pj (\u2211 i cj,iR\u0302i )2 = \u2211 j pj \u2211 i cj,iR\u0302i 2 \u2212 (\u2211 i cj,iR\u0302i )2 .\nUsing the CauchySchwarz inequality,\u2211 i cj,iR\u0302i 2 = \u2211 i \u221a cj,i 2 \u00b7 \u2211 i (\u221a cj,iR\u0302i )2 \u2265 (\u2211 i cj,iR\u0302i )2 .\nSo we get, Var(r\u0302)\u2212Var(r) \u2265 0.\nIn addition,\nVar(r\u0302) = \u2211 j p\u0302jR\u0302j 2 \u2212 \u2211 j p\u0302jR\u0302j 2 \u2264\u2211 j p\u0302jR\u0302j 2\n\u2264 \u2211 j p\u0302j M2 det(C)2 \u00b7R2max = M2 det(C)2 \u00b7R2max. B EXPERIMENTAL SETUP\nWe set up our experiments within the popular OpenAI baselines (Dhariwal et al., 2017) and kerasrl (Plappert, 2016) framework. Specifically, we integrate the algorithms and interact with OpenAI Gym (Brockman et al., 2016) environments (Table 3).\nB.1 RL ALGORITHMS\nA set of state-of-the-art reinforcement learning algorithms are experimented with while training under different amounts of noise, including Q-Learning (Watkins, 1989; Watkins & Dayan, 1992), Cross-Entropy Method (CEM) (Szita & Lo\u0308rincz, 2006), Deep SARSA (Sutton & Barto, 1998), Deep Q-Network (DQN) (Mnih et al., 2013; 2015; van Hasselt et al., 2016), Dueling DQN (DDQN) (Wang et al., 2016), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), Continuous DQN (NAF) (Gu et al., 2016) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithms. For each game and algorithm, three policies are trained based on different random initialization to decrease the variance in experiments. B.2 POST-PROCESSING REWARDS\nWe explore both symmetric and asymmetric noise of different noise levels. For symmetric noise, the confusion matrices are symmetric, which means the probabilities of corruption for each reward choice are equivalent. For instance, a confusion matrix\nC = [ 0.8 0.2 0.2 0.8 ] says that r1 could be corrupted into r2 with a probability of 0.2 and so does r2 (weight = 0.2).\nAs for asymmetric noise, two types of random noise are tested: 1) rand-one, each reward level can only be perturbed into another reward; 2) rand-all, each reward could be perturbed to any other reward. To measure the amount of noise w.r.t confusion matrices, we define the weight of noise as follows: C = (1\u2212 \u03c9) \u00b7 I + \u03c9 \u00b7N, \u03c9 \u2208 [0, 1], where \u03c9 controls the weight of noise; I and N denote the identity and noise matrix respectively. Suppose there are M outcomes for true rewards, N writes as:\nN = [ n0,0 n0,1 \u00b7 \u00b7 \u00b7 n0,M\u22121 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7\nnM\u22121,0 nM\u22121,1 \u00b7 \u00b7 \u00b7 nM\u22121,M\u22121\n] ,\nwhere for each row i, 1) rand-one: randomly choose j, s.t ni,j = 1 and ni,k 6= 0 if k 6= j; 2) randall: generate M random numbers that sum to 1, i.e., \u2211 j ni,j = 1. For the simplicity, for symmetric noise, we choose N as an anti-identity matrix. As a result, ci,j = 0, if i 6= j or i+ j 6= M . B.3 PERTURBED-REWARD MDP EXAMPLE\nTo obtain an intuitive view of the reward perturbation model, where the observed rewards are generated based on a reward confusion matrix, we constructed a simple MDP and evaluated the performance of robust reward Q-Learning (Algorithm 1) on different noise ratios (both symmetric and asymmetric). The finite MDP is formulated as Figure 4a: when the agent reaches state 5, it gets an instant reward of r+ = 1, otherwise a zero reward r\u2212 = 0. During the explorations, the rewards are perturbed according to the confusion matrix C2\u00d72 = [1\u2212 e\u2212, e\u2212; e+, 1\u2212 e+].\nThere are two experiments conducted in this setting: 1) performance of Q-Learning under different noise rates (Table 4); 2) robustness of estimation module in time-variant noise (Figure 4b). As shown in Table 4, Q-Learning achieved better results consistently with the guidance of surrogate rewards and the confusion matrix estimation algorithm. For time-variant noise, we generated varying amount of noise at different training stages: 1) e\u2212 = 0.1, e+ = 0.3 (0 to 1e4 steps); 2) e\u2212 = 0.2, e+ = 0.1 (1e4 to 3e4 steps); 3) e\u2212 = 0.3, e+ = 0.2 (3e4 to 5e4 steps); 4) e\u2212 = 0.1, e+ = 0.2 (5e4 to 7e4 steps). In Figure 4b, we show that Algorithm 1 is robust against time-variant noise, which dynamically adjusts the estimated C\u0303 after the noise distribution changes. Note that we set a maximum memory size for collected noisy rewards to let the agents only learn with recent observations. B.4 TRAINING DETAILS\nCartPole and Pendulum The policies use the default network from keras-rl framework. which is a five-layer fully connected network6. There are three hidden layers, each of which has 16 units and followed by a rectified nonlinearity. The last output layer is activated by the linear function. For\n6https://github.com/keras-rl/keras-rl/examples\nCartPole, We trained the models using Adam optimizer with the learning rate of 1e\u22123 for 10,000 steps. The exploration strategy is Boltzmann policy. For DQN and Dueling-DQN, the update rate of target model and the memory size are 1e\u22122 and 50, 000. For Pendulum, We trained DDPG and NAF using Adam optimizer with the learning rate of 5e\u22124 for 150, 000 steps. the update rate of target model and the memory size are 1e\u22123 and 100, 000.\nAtari Games We adopt the pre-processing steps as well as the network architecture from Mnih et al. (2015). Specifically, the input to the network is 84\u00d784\u00d74, which is a concatenation of the last 4 frames and converted into 84 \u00d7 84 gray-scale. The network comprises three convolutional layers and two fully connected layers7. The kernel size of three convolutional layer are 8 \u00d7 8 with stride 4 (32 filters), 4 \u00d7 4 with stride 2 (64 filters) and 3 \u00d7 3 with stride 1 (64 filters), respectively. Each hidden layer is followed by a rectified nonlinearity. Except for Pong where we train the policies for 3e7 steps, all the games are trained for 5e7 steps with the learning rate of 3e\u22124. Note that the rewards in the Atari games are discrete and clipped into {\u22121, 0, 1}. Except for Pong game, in which r = \u22121 means missing the ball hit by the adversary, the agents in other games attempt to get higher scores in the episode with binary rewards 0 and 1. C ESTIMATION OF CONFUSION MATRICES\n C.1 REWARD ROBUST RL ALGORITHMS\nAs stated in Section 3.3, the confusion matrix can be estimated dynamically based on the aggregated answers, similar to previous literature in supervised learning (Khetan et al., 2017). To get a concrete view, we take Q-Learning for an example, and the algorithm is called Reward Robust Q-Learning (Algorithm 3). Note that is can be extended to other RL algorithms by plugging confusion matrix estimation steps and the computed surrogate rewards, as shown in the experiments (Figure 6).\nAlgorithm 3 Reward Robust Q-Learning Input:\nM\u0303 = (S,A, R\u0303,P, \u03b3): MDP with corrupted reward channel T : transition function T : S \u00d7A \u2192 S N \u2208 N: upper bound of collected noisy rewards \u03b1 \u2208 (0, 1): learning rate in the update rule \u03b7 \u2208 (0, 1): weight of unbiased surrogate reward R\u0303(s, a): set of observed rewards when the state-action pair is (s, a). Output: Q(s): value function; \u03c0(s, t): policy function Initialize Q: S \u00d7A \u2192 R arbitrarily Set confusion matrix C\u0303 as zero while Q is not converged do\nStart in state s \u2208 S while s is not terminal do\nCalculate \u03c0 according to Q and exploration strategy a\u2190 \u03c0(s) s\u2032 \u2190 T (s, a) Observe noisy reward r\u0303(s, a) and add it to R\u0303(s, a) if \u2211\n(s,a) |R\u0303(s, a)| \u2265 N then Get predicted true reward r\u0304(s, a) using majority voting in every R\u0303(s, a) Estimate confusion matrix C\u0303 based on r\u0303(s, a) and r\u0304(s, a) (Eqn. (4)) Empty all the sets of observed rewards R\u0303(s, a)\nObtain surrogate reward r\u0307(s, a) using Rproxy = (1\u2212 \u03b7) \u00b7R + \u03b7 \u00b7C\u22121R Q(s\u2032, a)\u2190 (1\u2212 \u03b1) \u00b7Q(s, a) + \u03b1 \u00b7 (r\u0302(s, a) + \u03b3 \u00b7maxa\u2032 Q(s\u2032, a\u2032)) s\u2190 s\u2032\nreturn Q(s) and \u03c0(s)\n7https://github.com/openai/baselines/tree/master/baselines/common C.2 EXPECTATION-MAXIMIZATION IN ESTIMATION\nIn Algorithm 3, the predicted true reward r\u0304(s, a) is derived from majority voting in collected noisy sets R\u0303(s, a) for every state-action pair (s, a) \u2208 S \u00d7 A, which is a simple but efficient way of leveraging the expectation of aggregated rewards without assumptions on prior distribution of noise. In the following, we adopt standard Expectation-Maximization (EM) idea in the our estimation framework (arguably a simple version of it), inspired by previous works (Zhang et al., 2014).\nAssuming the observed noisy rewards are independent conditional on the true reward, we can compute the posterior probability of true reward from the Bayes\u2019 theorem:\nP(r = Ri|r\u0303(1) = R1, \u00b7 \u00b7 \u00b7 , r\u0303(n) = Rn) = P(r\u0303(1) = R1, \u00b7 \u00b7 \u00b7 , r\u0303(n) = Rn|r = Ri) \u00b7 P(r = Ri)\u2211 j P(r\u0303(1) = R1, \u00b7 \u00b7 \u00b7 , r\u0303(n) = Rn|r = Rj) \u00b7 P(r = Rj)\n= P(r = Ri) \u00b7 \u220fn k=1 P(r\u0303(k) = Rk|r = Ri)\u2211\nj [P(r = Rj) \u00b7 \u220fn k=1 P(r\u0303(k) = Rk|r = Rj)]\n(8)\nwhere P(r = Rj) is the prior of true rewards, and P(r\u0303 = Rk|r = Rj) is estimated by current estimated confusion matrix C\u0303: P(r\u0303 = Rk|r = Rj) = c\u0303j,i. Note that the inference should be conducted for each state-action pair (s, a) \u2208 S \u00d7 A in every iteration, i.e., P(r(s, a) = Ri|r\u0303(s, a, 1) = R1, \u00b7 \u00b7 \u00b7 , r\u0303(s, a, n) = Rn), abbreviated as P(r\u0304(s, a) = Ri), which requires relatively greater computation costs compared to the majority voting policy. It also points out an interesting direction to check online EM algorithms for our perturbed-RL problem.\nAfter the inference steps in Eqn. (8), the confusion matrix C\u0303 is then updated based on the posterior probabilities:\nc\u0303i,j =\n\u2211 (s,a) P(r\u0304(s, a) = Ri) \u00b7# [r\u0303(s, a) = Rj |r\u0304(s, a) = Ri]\u2211\n(s,a) P(r\u0304(s, a) = Ri) \u00b7#[r\u0304(s, a) = Ri] , (9)\nwhere P(r\u0304(s, a) = Ri) denotes the inference probabilities of true rewards based on collected noisy rewards sets R\u0303(s, a). To utilize EM algorithms in the robust reward algorithms (e.g., Algorithm 3), we need to replace Eqn. (4) by Eqn. (9) for the estimation of reward confusion matrix. C.3 STATE-DEPENDENT PERTURBED REWARD\nIn previous sections, to let our presentation stay focused, we consider the state-independent perturbed reward environments, which share the same confusion matrix for all states. In other words, the noise for different states is generated within the same distribution. More generally, the generation of r\u0303 follows a certain function C : S \u00d7 R \u2192 R\u0303, where different states may correspond to varied noise distributions (also varied confusion matrices). However, our algorithm is still applicable except for maintaining different confusion matrices Cs for different states. It is worthy to notice that Theorem 1 holds because the surrogate rewards produce an unbiased estimation of true rewards for each state, i.e., Er\u0303|r,st [r\u0302(st, at, st+1)] = r(st, at, st+1). Furthermore, Theorem 2 and 3 can be revised as:\nTheorem 4. (Upper bound) Let r \u2208 [0, Rmax] be bounded reward, Cs be invertible reward confusion matrices with det(Cs) denoting its determinant. For an appropriate choice of m, the Phased Q-Learning algorithm calls the generative model G(M\u0302)\nO\n( |S||A|T\n2(1\u2212 \u03b3)2 mins\u2208S{det(Cs)}2 log |S||A|T \u03b4 ) times in T epochs, and returns a policy such that for all state s \u2208 S, |V\u03c0(s)\u2212 V \u2217(s)| \u2264 , > 0, w.p. \u2265 1\u2212 \u03b4, 0 < \u03b4 < 1. Theorem 5. Let r \u2208 [0, Rmax] be bounded reward and all confusion matrices Cs are invertible. Then, the variance of surrogate reward r\u0302 is bounded as follows:\nVar(r) \u2264 Var(r\u0302) \u2264 M 2\nmins\u2208S{det(Cs)}2 \u00b7R2max.\nC.4 VARIANCE REDUCTION IN ESTIMATION\nAs illustrated in Theorem 3, our surrogate rewards introduce larger variance while conducting unbiased estimation which are likely to decrease the stability of RL algorithms. Apart from the linear combination idea (appropriate trade-off), some variance reduction techniques in statistics (e.g., correlated sampling) can also be applied into our surrogate rewards. Specially, Romoff et al. (2018) proposed to a reward estimator to compensate for stochastic corrupted reward signals. It is worthy to notice that their method is designed for variance reduction under stochastic (zero-mean) noise, which is no longer efficacious in more general perturbed-reward setting. However, it is potential to integrate their method with our robust-reward RL framework because surrogate rewards guarantee unbiasedness in reward expectation.\nTo verify this idea, we repeated the experiments of Cartpole in Section 4.2 but included variance reduction step for estimated surrogate rewards. Following Romoff et al. (2018), we adopted sample mean as a simple approximator during the training and set sequence length as 100. As shown in Figure 5, the models with only variance reduction technique (red lines) suffer from huge biases when the noise is large, and cannot converge to the optimal policies like those under noisy rewards. Nevertheless, they benefits from variance reduction for surrogate rewards (purple lines), which achieve faster convergence or better performance in many cases (e.g., Figure 5a (\u03c9 = 0.7), 5b (\u03c9 = 0.3)). It is also not surprising that the integrated algorithm (purple lines) outperforms better as the noise rate increases (indicating larger variance from Theorem 3, e.g., \u03c9 = 0.9). Similarly, Table 5 provides quantitative results which show that our surrogate benefits from variance reduction techniques (\u201cours + VRT\u201d), especially when the noise rate is large.\ngame with true rewards (r) , noisy rewards (r\u0303) (\u03b7 = 1) , sample-mean noisy rewards (\u03b7 = 1) , estimated surrogate rewards (r\u0307) and sample-mean estimated surrogate rewards . Note that confusion matrices C are unknown to the agents here. From top to the bottom, the noise rates are 0.1, 0.3, 0.7 and 0.9. Here we repeated each experiment 10 times with different random seeds and plotted 10% to 90% percentile area with its mean highlighted. C.5 EXPERIMENTAL RESULTS\nTo validate the effectiveness of robust reward algorithms (like Algorithm 3), where the noise rates are unknown to the agents, we conduct extensive experiments in CartPole. It is worthwhile to notice that the noisy rates are unknown in the explorations of RL agents. Besides, we discretize the\nobservation (velocity, angle, etc.) to construct a set of states and implement like Algorithm 3. The \u03b7 is set 1.0 in the experiments.\nFigure 6 provides learning curves from five algorithms with different kinds of rewards. The proposed estimation algorithms successfully obtain the approximate confusion matrices, and are robust in the unknown noise environments. From Figure 7, we can observe that the estimation of confusion matrices converges very fast. The results are inspiring because we don\u2019t assume any additional knowledge about noise or true reward distribution in the implementation.\nCartPole game with true rewards (r) , noisy rewards (r\u0303) (\u03b7 = 1) and estimated surrogate rewards (r\u0307) . Note that confusion matrices C are unknown to the agents here. From top to the bottom, the noise rates are 0.1, 0.3, 0.7 and 0.9. Here we repeated each experiment 10 times with different random seeds and plotted 10% to 90% percentile area with its mean highlighted. D.1 SUPPLEMENTARY QUANTITATIVE RESULTS\n D SUPPLEMENTARY EXPERIMENTAL RESULTS\nD.2 VISUALIZATIONS ON CONTROL GAMES\n(r) , noisy rewards (r\u0303) and surrogate rewards (r\u0302) (\u03b7 = 1) . Both symmetric and asymmetric noise are conduced in the experiments. From top to the bottom, the noise rates are 0.1, 0.3, 0.7 and 0.9, respectively. Here we repeated each experiment 6 times with different random seeds and plotted 10% to 90% percentile area with its mean highlighted.\nD.3 VISUALIZATIONS ON ATARI GAMES8\nD.3.1 PONG\nAsymmetric Noise (rand-one)\n8For the clarity purpose, we remove the learning curves (blue ones in previous figures) with true rewards except for Pong-v4 game.\nAsymmetric Noise (rand-all) D.3.2 AIRRAID\nD.3.3 ALIEN D.3.4 CARNIVAL\n D.3.5 MSPACMAN\nD.3.6 PHOENIX\nD.3.7 SEAQUEST", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Think you have solved question answering? try arc, the ai2 reasoning challenge , Authors: Peter Clark Isaac Cowhey... , 2018\n2.  Title: Efficient and robust question answering from minimal context over documents , Authors: Sewon Min Victor Zhong... , 2018\n3.  Title: Retrieve-and-read: Multi-task learning of information retrieval and reading , Authors: Kyosuke Nishida Itsumi Saito... , 2018\n4.  Title: Yuanfudao at semeval-2018 task 11: Three-way attention and relational knowledge for commonsense machine comprehension , Authors: Liang Wang , 2018\n5.  Title: Reinforced ranker-reader for open-domain question answering , Authors: Shuohang Wang Mo Yu... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Open-domain question answering remains a challenging task as it requires models that are capable of understanding questions and answers, collecting useful information, and reasoning over evidence. Previous work typically formulates this task as a reading comprehension or entailment problem given evidence retrieved from search engines. However, existing techniques struggle to retrieve indirectly related evidence when no directly related evidence is provided, especially for complex questions where it is hard to parse precisely what the question asks. In this paper we propose a retriever-reader model that learns to attend on essential terms during the question answering process. We build (1) an essential term selector which first identifies the most important words in a question, then reformulates the query and searches for related evidence; and (2) an enhanced reader that distinguishes between essential terms and distracting words to predict the answer. We evaluate our model on multiple open-domain QA datasets where it outperforms the existing state-of-the-art, notably leading to a relative improvement of 8.1% on the AI2 Reasoning Challenge (ARC) dataset.", "There has recently been growing interest in building better retrievers for open-domain QA. Wang et al. (2018) proposed a Reinforced Ranker-Reader model that ranks retrieved evidence and assigns different weights to evidence prior to processing by the reader. Min et al. (2018) demonstrated that for several popular MRC datasets (e.g. SQuAD, TriviaQA) most questions can be answered using only a few sentences rather than the entire document. Motivated by this observation, they built a sentence selector to gather this potential evidence for use by the reader model. Nishida et al. (2018) developed a multi-task learning (MTL) method for a retriever and reader in order to obtain a strong retriever that considers certain passages including the answer text as positive samples during training. The proposed MTL framework is still limited to the scenario when it is feasible to discover whether the passages contain the answer span. Although these works have achieved progress on open-domain QA by improving the ranking or selection of given evidence, few have focused on the scenario where the model needs to start by searching for the evidence itself. Scientific Question Answering (SQA) is a representative open-domain task that requires capability in both retrieval and reading comprehension. In this paper, we study question answering on the AI2 Reasoning Challenge (ARC) scientific QA dataset (Clark et al., 2018). This dataset contains elementary-level multiple-choice scientific questions from standardized tests and a large corpus of relevant information gathered from search engines. The dataset is partitioned into \u201cChallenge\u201d and \u201cEasy\u201d sets.", "This paper has made two major contributions: (1) a new neural reader architecture for multi-choice QA; (2) it is the first to introduce essential term selection to open-domain QA, to the best of my knowledge. The above two proposed modules (reader and selector) are not very novel, but are still valid contributions to me. Experiments on the ARC dataset shows that (1) the proposed reader itself improves over the state-of-the-art on the leaderboard; (2) introducing the essential term selector further improves the above results by about 2%.\n\nAlthough the paper also provides additional experiments on other datasets, I feel that the contributions of the proposed methods are not sufficiently verified. I would suggest the authors consider the following further experiments that I believe could improve its ratings:\n\n(1) The proposed reader works very well on ARC. However, besides the BiDAF, there is no comparison between the proposed reader and previous models on datasets other than ARC. In order to know whether the result generalizes or not, I think the authors should conduct experiments on the regular RACE or other multi-choice QA datasets, to fully test the reader model.\n\n(2) It is not clear whether the essential term selector could help on datasets other than science questions. Again, the authors reported results on two other datasets. However, on neither of these datasets the ET-RR was compared with ET-RR (Concat). Therefore, I have concerns that the proposed framework may only be significant on the ARC dataset.\n\nMoreover, it will be interesting to see whether the essential term selector can be learned from distant supervision. For example, using REINFORCE to learn the selector model with the rewards from the end-task performance. The current framework heavily relies on the supervised training data from (Khashabi et al., 2017), which may limit its usage to other datasets.\n5: Marginally below acceptance threshold"], "input": "1 INTRODUCTION\nOpen-domain question answering (QA) has been extensively studied in recent years. Many existing works have followed the \u2018search-and-answer\u2019 strategy and achieved strong performance (Chen et al., 2017; Kwon et al., 2018; Wang et al., 2018) spanning multiple QA datasets such as TriviaQA (Joshi et al., 2017), SQuAD (Rajpurkar et al., 2016), MS-Macro (Nguyen et al., 2016), among others.\nHowever, open-domain QA tasks become inherently more difficult when (1) dealing with questions with little available evidence; (2) solving questions where the answer type is free-form text (e.g. multiple-choice) rather than a span among existing passages (i.e., \u2018answer span\u2019); or when (3) the need arises to understand long and complex questions and reason over multiple passages, rather than simple text matching. As a result, it is essential to incorporate commonsense knowledge or to improve retrieval capability to better capture partially related evidence (Chen et al., 2017).\nAs shown in Table 1, the TriviaQA, SQuAD, and MS-Macro datasets all provide passages within which the correct answer is guaranteed to exist. However, this assumption ignores the difficulty of retrieving question-related evidence from a large volume of open-domain resources, especially when considering complex questions which require reasoning or commonsense knowledge. On the other hand, ARC does not provide passages known to contain the correct answer. Instead, the task of identifying relevant passages is left to the solver. However, questions in ARC have multiple answer choices that provide indirect information that can help solve the question. As such an effective model needs to account for relations among passages, questions, and answer choices.\nFigure 1 shows an example of a question in the ARC dataset and demonstrates the difficulties in retrieval and reading comprehension. As shown for Choice 1 (C1), a simple concatenation of the question and the answer choice is not a reliable query and is of little help when trying to find supporting evidence to answer the question (e.g. we might retrieve sentences similar to the question or the answer choice, but would struggle to find evidence explaining why the answer choice is correct). On the other hand, a reformulated query consisting of essential terms in the question and\nChoice 4 can help retrieve evidence explaining why Choice 4 is a correct answer. To achieve this, the model needs to (1) ensure that the retrieved evidence supports the fact mentioned in both the question and the answer choices and (2) capture this information and predict the correct answer.\nTo address these difficulties, we propose an essential-term-aware Retriever-Reader (ET-RR) model that learns to attend on essential terms during retrieval and reading. Specifically, we develop a two-stage method with an essential term selector followed by an attention-enhanced reader.\nEssential term selector. ET-Net is a recurrent neural network that seeks to understand the question and select essential terms, i.e., key words, from the question. We frame this problem as a classification task for each word in the question. These essential terms are then concatenated with each answer choice and fed into a retrieval engine to obtain related evidence.\nAttention-Enhanced Reader. Our neural reader takes the triples (question, answer choice, retrieved passage) as input. The reader consists of a sequence of language understanding layers: an input layer, attention layer, sequence modeling layer, fusion layer, and an output layer. The attention and fusion layers help the model to obtain a refined representation of one text sequence based on the understanding of another, e.g. a passage representation based on an understanding of the question. We further add a choice-interaction module to handle the semantic relations and differences between answer choices. Experiments show that this can further improve the model\u2019s accuracy.\nWe evaluate our model on the ARC dataset, where our model achieves an accuracy of 36.61% on the test set, thus ranking first on the official leaderboard. We also adapt two datasets to the open-domain setting, RACE-Open and MCScript-Open, where we outperform baseline models by a large margin. Ablation studies show that each of our model\u2019s components contributes to its accuracy.\n1For SQuAD and TriviaQA, since their questions are paired with span-type answers, it is convenient to obtain ranking supervision where retrieved passages are relevant via distant supervision; however free-form questions in ARC result in a lack of supervision which makes the problem more difficult. For MS-Macro, the dataset is designed to annotate relevant passages though it has free-form answers. 3 APPROACH\nIn this section, we introduce the essential-term-aware retriever-reader model (ET-RR). As shown in Figure 2, we build a term selector to discover which terms are essential in a question. The selected terms are then used to formulate a more efficient query enabling the retriever to obtain related evidence. The retrieved evidence is then fed to the reader to predict the final answer.\nFor a question with q words Q = {wQt } q t=1 along with its N answer choices C = {Cn}Nn=1 where Cn = {wCt }ct=1, the essential-term selector chooses a subset of essential terms E \u2282 Q, which are then concatenated with each Cn to formulate a query. The query for each answer choice, E +Cn, is sent to the retriever (e.g. Elastic Search2), and the top K retrieved sentences based on the scores returned by the retriever are then concatenated into the evidence passage Pn = {wPt } p t=1.\nNext, given these text sequences Q, C, and P = {Pn}Nn=1, the reader will determine a matching score for each triple {Q,Cn,Pn}. The answer choice Cn\u2217 with the highest score is selected. We first introduce the reader model in Section 3.1 and then the essential term selector in Section 3.2. 3.1 READER MODEL\n 3.1.1 INPUT LAYER\nTo simplify notation, we ignore the subscript n denoting the answer choice until the final output layer. In the input layer, all text inputs\u2014the question, answer choices, and passages, i.e., retrieved evidence\u2014are converted into embedded representations. Similar to Wang (2018), we consider the following components for each word:\nWord Embedding. Pre-trained GloVe word embedding with dimensionality dw = 300.\nPart-of-Speech Embedding and Named-Entity Embedding. The part-of-speech tags and named entities for each word are mapped to embeddings with dimension 16.\nRelation Embedding. A relation between each word in P and any word in Q or C is mapped to an embedding with dimension 10. In the case that multiple relations exist, we select one uniformly at random. The relation is obtained by querying ConceptNet (Speer et al., 2017).\nFeature Embeddings. Three handcrafted features are used to enhance the word representations:\n1. Word Match. If a word or its lemma of P exists in Q or C, then this feature is 1 (0 otherwise).\n2. Word Frequency. A logarithmic term frequency is calculated for each word.\n3. Essential Term. For the i-th word in Q, this feature, denoted as wei , is 1 if the word is an essential term (0 otherwise). Let we = [we1 , we2 , ..., weq ] denote the essential term vector.\nFor Q,C,P, all of these components are concatenated to obtain the final word representations WQ \u2208 Rq\u00d7dQ ,WC \u2208 Rc\u00d7dC ,WP \u2208 Rp\u00d7dP , where dQ, dC , dP are the final word dimensions of Q,C, and P. 3.1.2 ATTENTION LAYER\nAs shown in Figure 2, after obtaining word-level embeddings, attention is added to enhance word representations. Given two word embedding sequences WU ,WV , word-level attention is calculated as:\nM \u2032 UV = WUU \u00b7 (WV V)>; MUV = softmax(M \u2032 UV ); W V U = MUV \u00b7 (WV V), (1)\nwhere U \u2208 RdU\u00d7dw and V \u2208 RdV \u00d7dw are two matrices that convert word embedding sequences to dimension dw, and M \u2032\nUV contains dot products between each word in WU and WV , and softmax is applied on M \u2032\nUV row-wise.\n2https://www.elastic.co/products/elasticsearch\nThree types of attention are calculated using Equation (1): (1) question-aware passage representation WQP \u2208 Rp\u00d7dw , (2) question-aware choice representation W Q C \u2208 Rc\u00d7dw , and (3) passage-aware choice representation WPC \u2208 Rc\u00d7dW . 3.1.3 SEQUENCE MODELING LAYER\nTo model the contextual dependency of each text sequence, we use BiLSTMs to process the word representations obtained from the input layer and attention layer:\nHq = BiLSTM[WQ]; H c = BiLSTM[WC ;W P C ;W Q C ]; H p = BiLSTM[WP ;W Q P ],\n(2) where Hq \u2208 Rq\u00d7l,Hc \u2208 Rc\u00d7l, and Hp \u2208 Rp\u00d7l are the hidden states of the BiLSTMs, \u2018;\u2019 is feature-wise concatenation, and l is the size of the hidden states. 3.1.4 FUSION LAYER\nWe further convert each question and answer choice into a single vector: q \u2208 Rl and c \u2208 Rl:\n\u03b1q = softmax([H q;we] \u00b7w>sq); q = Hq >\u03b1q; \u03b1c = softmax(H c \u00b7w>sc); c = Hc >\u03b1c, (3)\nwhere the essential-term feature we from Section 3.1.1 is concatenated with Hq , and wsq and wsc are learned parameters.\nFinally, a bilinear sequence matching is calculated between Hp and q to obtain a question-aware passage representation, which is used as the final passage representation:\n\u03b1p = softmax(H p \u00b7 q); p = Hp>\u03b1p. (4) 3.1.5 CHOICE INTERACTION\nWhen a QA task provides multiple choices for selection, the relationship between the choices can provide useful information to answer the question. Therefore, we integrate a choice interaction layer to handle the semantic correlation between multiple answer choices. Given the hidden state Hcn of choice cn and Hci of other choices ci, \u2200i 6= n, we calculate the differences between the hidden states and apply max-pooling over the differences:\ncinter = Maxpool(H cn \u2212 1 N \u2212 1 \u2211 i6=n Hci), (5)\nwhere N is the total number of answer choices. Here, cinter characterizes the differences between an answer choice cn and other answer choices. The final representation of an answer choice is updated by concatenating the self-attentive answer choice vector and inter-choice representation as\ncfinal = [c; cinter ]. (6) 3.1.6 OUTPUT LAYER\nFor each tuple {q,pn, cn}Nn=1, two scores are calculated by matching (1) the passage and answer choice and (2) question and answer choice. We use the bilinear form for both matchings. Finally, a softmax function is applied over N answer choices to determine the best answer choice:\nspcn = pnW pccfinaln ; s qc n = qW qccfinaln ; s = softmax(s pc) + softmax(sqc), (7)\nwhere spcn , s qc n are the scores for answer choice 1 \u2264 n \u2264 N ; spc, sqc are score vectors for all N choices; and s contains the final scores for each answer choice. During training, we use a crossentropy loss. 3.2 ESSENTIAL TERM SELECTOR\nEssential terms are key words in a question that are crucial in helping a retriever obtain related evidence. Given a question Q and N answer choices C1, . . . ,CN , the goal is to predict a binary variable yi for each word Qi in the question Q, where yi = 1 if Qi is an essential term and 0\nTable 3: Precison, recall and F1 scores of different selectors.\nModel Precision Recall F1\nMaxPMI 0.88 0.65 0.75 SumPMI 0.88 0.65 0.75 PropSurf 0.68 0.64 0.66 PropLem 0.76 0.64 0.69 ET Classifier 0.91 0.71 0.80 ET-Net 0.74 0.90 0.81\notherwise. To address this problem, we build a neural model, ET-Net, which has the same design as the reader model for the input layer, attention layer, and sequence modeling layer to obtain the hidden state Hq for question Q.\nIn detail, we take the question Q and the concatenation C of all N answer choices as input to ETNet. Q and C first go through an input layer to convert to the embedded word representation, and then word-level attention is calculated to obtain a choice-aware question representation WCQ as in Equation (1). We concatenate the word representation and word-level attention representation of the question and feed it into the sequence modeling layer:\nHq = BiLSTM[WQ;W C Q]. (8)\nAs shown in Figure 2, the hidden states obtained from the attention layer are then concatenated with the embedded representations of Q and fed into a projection layer to obtain the prediction vector y \u2208 Rq for all words in the question:\ny = [Hq;WfQ] \u00b7w s, (9)\nwhere ws contains the learned parameters, and WfQ is the concatenation of the POS embedding, NER embedding, relation embedding, and feature embedding from Section 3.1.1.\nAfter obtaining the prediction for each word, we use a binary cross-entropy loss to train the model. During evaluation, we take words with yi greater than 0.5 as essential terms. 4 EXPERIMENTS\nIn this section, we first discuss the performance of the essential term selector, ET-Net, on a public dataset. We then discuss the performance of the whole retriever-reader pipeline, ET-RR, on the ARC, RACE-Open and MCScript-Open datasets. For both the ET-Net and ET-RR models, we use 96-dimensional hidden states and 1-layer BiLSTMs in the sequence modeling layer. A dropout rate of 0.4 is applied for the embedding layer and the BiLSTMs\u2019 output layer. We use adamax (Kingma & Ba, 2014) with a learning rate of 0.02 and batch size of 32. The model is run for 100 epochs. 4.1 PERFORMANCE ON ESSENTIAL TERM SELECTION\nWe use the public dataset from Khashabi et al. (2017) which contains 2,223 annotated questions, each accompanied by four answer choices. Table 2 gives an example of an annotated question. As shown, the dataset is annotated for binary classification. For each word in the question, the data measures whether the word is an \u201cessential\u201d term according to 5 annotators. We then split the dataset into training, development, and test sets using an 8:1:1 ratio and select the the model that performs best on the development set.\nTable 3 shows the performance of our essential term selector and baseline models from Khashabi et al. (2017). MAXPMI and SUMPMI score the importance of a word w by taking the max or sum of its PMI p(w, c) scores for all answer choices c. PROPSURF and PROPLEM are baselines that consider a word as an essential term if it or its lemmatized word appears at least a certain proportion of times as essential in the dataset. ET Classifier is an SVM-based model from Khashabi et al.\nTable 5: Statistics on ARC and RACE-Open. Corpus size is the number of sentences.\nDataset ARC RACEOpen\nMCScriptOpen\nTrain 1,119 9,531 1,036 Dev 299 473 156 Test 1,172 528 319 Corpus 1.46M 0.52M 24.2K\nTable 6: Ablation test on attention components of ET-RR on ARC. \u2018\u2013\u2019 denotes the ablated feature.\nModel Test\nET-RR 36.61 \u2013 inter-choice 36.36 \u2013 passage-choice 35.41 \u2013 question-choice 34.47 \u2013 passage-question 34.05\n(2017) requiring over 100 handcrafted features. As shown, our ET-Net achieves a comparable result with the ET Classifier in terms of F1 Score.\nTable 4 shows example predictions made by ET-Net. As shown, ET-Net is capable of selecting most ground-truth essential terms. It rejects certain words such as \u201corganisms\u201d which have a high TFIDF in the corpus but are not relevant to answering a particular question. This shows its ability to discover essential terms according to the context of the question. 4.2 PERFORMANCE ON OPEN-DOMAIN MULTIPLE-CHOICE QA\nWith the trained essential term selector (ET-Net) from the previous experiment, we train and evaluate the reader model on three open-domain multiple-choice QA datasets. All datasets are associated with a sentence-level corpus. In the experiments, ET-RR generates a query for each of the N answer choices. For each query, ET-RR then obtains the top K sentences returned by the retriever and considers these sentences as a passage for the reader. We set K = 10 for all experiments and report results for different K in the ablation test. Detailed statistics are shown in Table 5.\n\u2022 ARC (Clark et al., 2018): We consider the \u2018Challenge\u2019 set in the ARC dataset and use the provided corpus during retrieval. \u2022 RACE-Open: We adapted the RACE dataset (Lai et al., 2017) to the open-domain set-\nting. Originally, each question in RACE comes with a specific passage. To enable passage retrieval, we concatenate all passages into a corpus with sentence deduplication.3\n\u2022 MCScript-Open: The MCScript (Ostermann et al., 2018) dataset is also adapted to the open-domain setting. Again we concatenate all passages to build the corpus.4\nWe compare ET-RR against existing retrieval-reader methods on both datasets. Accuracy is shown in Table 7. Results for ARC are obtained from the official leaderboard.7 On the ARC dataset, ET-RR outperforms all previous models with a relative 8.1% improvement over the state-of-the-art BiLSTM Max-out method. On the RACE-Open and MCScript-Open datasets, ET-RR achieves a relative improvement of 24.6% and 10.5% on the test set compared with the IR solver respectively.\n3As short questions are usually passage-specific and retrieval can rarely find any related passage, we only keep questions with more than 15 words.\n4We keep questions with more than 10 words rather than 15 words to ensure that there is sufficient data. 5IR solver sends question plus each answer choice as query to the search engine, then pick the answer choice of which the top retrieved sentence has the highest score as the answer 6Different from ET-RR, in the original BiDAF baseline, the sentences returned by each query are mixed together, then the top N \u00d7K sentences are aggregated as a whole passage and passed to the reader. 7Snapshot from http://data.allenai.org/arc/ on September 26, 2018 4.3 ABLATION STUDY\nFinally, we investigate how each component contributes to model performance.\nAttention components. Table 6 demonstrates how the attention components contribute to the performance of ET-RR. As shown, ET-RR with all attention components performs the best on the ARC test set. The performance of ET-RR without passage-question attention drops the most significantly out of all the components. It is worth noting that the choice interaction layer gives a further 0.24% boost on test accuracy.\nEssential term selection. To understand the contribution of our essential-term selector, we introduce two variants of ET-RR:\n\u2022 ET-RR (Concat). Concatenates the original question and answer choice as the query. \u2022 ET-RR (TF-IDF). We calculate the TF-IDF scores and take top 30% words8 with the highest\nscores in the question to concatenate with each answer choice as a query.\nTable 8 shows an ablation study comparing different query formulation methods and amount of retrieved evidence K. As shown, with the essential term selector ET-Net, the model consistently outperforms other baselines, given different numbers of retrievals K. The performance of all models works best when K = 10. Furthermore, only using TF-IDF to select essential terms in a question is not effective. When K = 10, the ET-RR (TF-IDF) method even performs worse than ET-RR (Concat). This illustrates the challenges in understanding what is essential in a question. 5 CONCLUSION\nWe presented a new retriever-reader model (ET-RR) for open-domain QA. Our pipeline has the following contributions: (1) we built an essential term selector (ET-Net) which helps the model understand which words are essential in a question leading to more effective search queries when retrieving related evidence; (2) we developed an attention-enhanced reader with attention and fusion among passages, questions, and candidate answers. Experimental results show that ET-RR outperforms existing QA models on the ARC, RACE-Open and MCScipt-Open datasets.\n8According to the annotated dataset, around 30% of the terms in each question are labelled as essential.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Multi-scale dense networks for resource efficient image classification , Authors: Gao Huang Danlu Chen... , 2017\n2.  Title: Densely connected convolutional networks , Authors: Gao Huang Zhuang Liu... , 2017\n3.  Title: Condensenet: An efficient densenet using learned group , Authors: Gao Huang Shichen Liu... , 2018\n4.  Title: Blockdrop: Dynamic inference paths in residual networks. 2018 , Authors: Zuxuan Wu Tushar Nagarajan... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Deep neural networks have achieved outstanding performance in many real-world applications with the expense of huge computational resources. The DenseNet, one of the recently proposed neural network architecture, has achieved the state-of-the-art performance in many visual tasks. However, it has great redundancy due to the dense connections of the internal structure, which leads to high computational costs in training such dense networks. To address this issue, we design a reinforcement learning framework to search for efficient DenseNet architectures with layer-wise pruning (LWP) for different tasks, while retaining the original advantages of DenseNet, such as feature reuse, short paths, etc. In this framework, an agent evaluates the importance of each connection between any two block layers, and prunes the redundant connections. In addition, a novel reward-shaping trick is introduced to make DenseNet reach a better trade-off between accuracy and float point operations (FLOPs). Our experiments show that DenseNet with LWP is more compact and efficient than existing alternatives.", "Huang et al. (2018) proposed group convolution to remove connections between layers in DenseNet for which this feature reuse is superfluous; Huang et al. (2017a) also suggested progressively update prediction for every test sample to unevenly adapt the amount of computational resource at inference time. The most related work is BlockDrop (Wu et al., 2018), which used reinforcement learning to prune weight dynamically at inference time but can only be applied to ResNet or its variants. In contrast, our approach is based on DenseNet, aiming to find efficient network structure based the densely connected features of DenseNet.", "The paper introduces RL based approach to prune layers in a DenseNet. This work extends BlockDrop to DenseNet architecture making the controller independent form the input image. The approach is evaluated on CIFAR10 and CIFAR100 datasets as well as on ImageNet showing promising results.\n\nIn order to improve the paper, the authors could take into consideration the following points:\n1. Given the similarity of the approach with BlockDrop, I would suggest to discuss it in the introduction section clearly stating the similarities and the differences with the proposed approach. \n2. BlockDrop seems to introduce a general framework of policy network to prune neural networks. However, the authors claim that BlockDrop \"can only be applied to ResNets or its variants\". Could the authors comment on this? \n3. In the abstract, the authors claim: \"Our experiments show that DenseNet with LWP is more compact and efficient than existing alternatives\". It is hard to asses if the statement is correct given the evidence presented in the experimental section. It is not clear if the method is more efficient and compact than others, e. g.  CondenseNet. \n4. In the experimental section, addressing the following questions would make the section stronger: What is more important FLOPs or number of parameters? What is the accuracy drop we should allow to pay for reduction in number of parameters or FLOPs?\n5. For the evaluation, I would suggest to show that the learned policy is better than a random one: e. g. not using the controller to define policy (in line 20 of the algorithm) and using a random random policy instead.\n6. In Table 1, some entries for DenseNet LWP are missing. Is the network converging for this setups? \n7. \\sigma is not explained in section 3.3. What is the intuition behind this hyper parameter?\n\n\nAdditional comments:\n1. It might be interesting to compare the method introduced in the paper to a scenario where the controller is conditioned on an input image and adaptively selects the connections/layers in DenseNet at inference time.\n2. It might be interesting to report the number of connections in Table 1 for all the models.\n\n4: Ok but not good enough - rejection"], "input": "1 INTRODUCTION\nDeep neural networks are increasingly used on mobile devices, where computational resources are quite limited(Chollet, 2017; Sandler et al., 2018; Zhang et al., 2017; Ma et al., 2018). Despite the success of deep neural networks, it is very difficult to make efficient or even real-time inference on low-end devices, due to the intensive computational costs of deep neural networks. Thus, the deep learning community has paid much attention to compressing and accelerating different types of deep neural networks(Gray et al., 2017).\nAmong recently proposed neural network architectures, DenseNet (Huang et al., 2017b) is one of the most dazzling structures which introduces direct connections between any two layers with the same featuremap size. It can scale naturally to hundreds of layers, while exhibiting no optimization difficulties. In addition, it achieved state-of-the-art results across several highly competitive datasets. However, recent extensions of Densenet with careful expert design, such as Multi-scale DenseNet(Huang et al., 2017a) and CondenseNet(Huang et al., 2018), have shown that there exists high redundancy in DenseNet. Our paper mainly focuses on how to compress and accelerate the DenseNet with less expert knowledge on network design.\nA number of approaches have been proposed to compress deep networks. Generally, most approaches can be classified into four categories: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation(Gray et al., 2017). Unlike these approaches requiring intensive expert experience, automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Liu et al., 2018; Zoph & Le, 2016; Pham et al., 2018; Liu et al., 2017a; Brock et al., 2017).\nHowever, none of these NAS methods are efficient for DenseNet due to the dense connectivity between layers. It is thus interesting and important to develop an adaptive strategy for searching an on-demand neural network structure for DenseNet such that it can satisfy both computational budget and inference accuracy requirement.\nTo this end, we propose a layer-wise pruning method for DenseNet based on reinforcement learning. Our scheme is that an agent learns to prune as many as possible weights and connections while maintaining good accuracy on validation dataset. As illustrated in Figure 1, our agent learns to output a sequence of actions and receives reward according to the generated network structure on validation datasets. Additionally, our agent automatically generates a curriculum of exploration, enabling effective pruning of neural networks.\nExtensive experiments on several highly competitive datasets show that our method largely reduces the number of parameters as well as flops, while maintaining or slightly degrading the prediction performance, such that the corresponding network architecture can adaptively achieve a balance between inference accuracy and computational resources. 2 BACKGROUND\nWe first introduce reinforcement learning and policy gradient in Section 2.1, and DenseNet in Section 2.2, and finally neural architecture search in Section 2.3. 2.1 REINFORCEMENT LEARNING AND POLICY GRADIENT\nReinforcement learning considers the problem of finding an optimal policy for an agent that interacts with an uncertain environment and collects reward per action(Sutton et al., 1998). The goal of the agent is to maximize the long-term cumulative reward. Formally, this problem can be formulated as a Markov decision process over the environment states s \u2208 S and agent actions a \u2208 A, under an unknown environmental dynamic defined by a transition probability T (s\u2032|s, a) and a reward signal r(s, a) immediately following the action a performed at state s. The agent\u2019s action a is selected by a conditional probability distribution \u03c0(a|s) called policy or actor. In policy gradient methods, we consider a set of candidate policies \u03c0\u03b8(a|s) parameterized by \u03b8 and obtain the optimal policy by maximizing the expected cumulative reward or return\nJ(\u03b8) = Es\u223c\u03c1\u03c0,a\u223c\u03c0(a|s) [r(s, a)] , (1)\nwhere \u03c1\u03c0(s) = \u2211\u221e t=1 \u03b3\nt\u22121Pr(st = s) is the normalized discounted state visit distribution with a discount factor \u03b3 \u2208 [0, 1). To simplify the notation, we denote Es\u223c\u03c1\u03c0,a\u223c\u03c0(a|s)[\u00b7] by simply E\u03c0[\u00b7] in the rest of paper.\nAccording to the policy gradient theorem(Sutton et al., 1998), the gradient of J(\u03b8) can be written as\n\u2207\u03b8J(\u03b8) = E\u03c0 [\u2207\u03b8 log \u03c0(a|s)Q\u03c0(s, a)] , (2) where Q\u03c0(s, a) = E\u03c0 [\u2211\u221e t=1 \u03b3 t\u22121r(st, at)|s1 = s, a1 = a ]\ndenotes the expected return under policy \u03c0 after taking an action a in state s, which is also called critic.\nSince the expectation in Eq (3) is over action, it is helpful to estimate a value function V (s) and subtract it from Q(s, a) to reduce variance while keeping unbiased.\n\u2207\u03b8J(\u03b8) = E\u03c0 [\u2207\u03b8 log \u03c0(a|s) (Q\u03c0(s, a)\u2212 V (s))] . (3)\nThe most straightforward way is to simulate the environment with the current policy \u03c0 to obtain a trajectory {(st, at, rt)}nt=1 and estimate \u2207\u03b8J(\u03b8) using the Monte Carlo estimation:\n\u2207\u0302\u03b8J(\u03b8) = 1\nn n\u2211 t=1 \u03b3t\u22121\u2207\u03b8 log \u03c0(at|st) ( Q\u0302\u03c0(st, at)\u2212 V\u0302 (st) ) , (4)\nwhere Q\u0302\u03c0(st, at) is an empirical estimate of Q\u03c0(st, at), e.g., Q\u0302\u03c0(st, at) = \u2211 j\u2265t \u03b3\nj\u2212trj , and V\u0302 (st) is an empirical estimate of V (s). 2.2 DENSENET\nDensely connected networks(Huang et al., 2017b) consist of multiple dense blocks, each of which also consists of multiple layers. Each layer produces k features maps, where k is referred to the growth rate of the network. The distinguishing property of DenseNets is that the input of each layer is a concatenation of all feature maps generated by all preceding layers within the same dense block.\nInside every dense block, the first transformation is a composition of batch normalization(BN) and rectified linear units(RELU), followed by the first convolutional layer in the sequence which reduces the number of channels to save computational cost by using the 1 \u00d7 1 filters. The output is then followed by another BN-ReLU combination transformation and is then reduced to the final k output features through a 3 \u00d7 3 convolution. 2.3 NEURAL ARCHITECTURE SEARCH\nNeural Architecture Search(NAS) is a method for automated design of neural network structures, with the aid of either evolutionary algorithms(Xie & Yuille, 2017; Real et al., 2017) or reinforcement learning (Baker et al., 2016; Cai et al., 2018; Zhong et al., 2017; Zoph & Le, 2016; Zoph et al., 2017). When using reinforcement learning, the agent performs a sequence of actions, which specifies a network structure; this network is then trained and its corresponding validation performance is returned as the reward to update the agent. 3 METHOD\nWe analyze the dense connections of DenseNet in Section 3.1, then we model the layer-wise pruning as a Markov decision process (MDP)(Puterman, 2014) and design a Long-short term memory( LSTM)(Hochreiter & Schmidhuber, 1997) controller to generate inference paths in Section 3.2. The interaction between the agent (i.e., the LSTM controller) and the environment (i.e., the DenseNet) is described in Figure 2. The reward shaping technique in our method is introduced in Section 3.3. Finally, we show the complete training process of LWP in Section 3.4. 3.1 PRETRAINED DENSE CONVOLUTIONAL NETWORKS\nVanilla DenseNet consists of four parts: the first convolution layer, multiple dense blocks, transition layers and finally the fully-connected layer. The first convolution layer is only for feature extraction from raw data. As for the multiple dense blocks, each dense block consists of multiple layers. The transition layers are used as down-sampling layers to change the size of feature maps and the last full-connected layer is used for image classification. Obviously, the dense connections are mainly reflected on the dense blocks. Therefore, we study the connection policy for dense layers in this paper. 3.2 GENERATE INFERENCE PATHS WITH AN LSTM CONTROLLER\nSuppose the DenseNet has L layers, the controller needs to make K (equal to the number of layers in dense blocks) decisions. For layer i, we specify the number of previous layers to be connected in the range between 0 and ni (ni = i). All possible connections among the DenseNet constitute the action space of the agent. However, the time complexity of traversing the action space is O( \u220fK i=1 2\nni), which is NP-hard and unacceptable for DenseNet(Huang et al., 2017b). Fortunately, reinforcement learning is good at solving sequential decision optimization problems and we model the network pruning as a Markov Decision Process(MDP). Since these hierarchical connections have time-series dependencies, it is natural to train LSTM as the controller to simply solve the above-mentioned issue.\nAt the first time step, the LSTM controller receives an empty embedding vector as the input that is regarded as the fixed state s of the agent, and the output of the previous time step is the input for the next time step. Each output neuron in the LSTM is equipped with \u03b4(x) = 11+e\u2212x , so that the output oi defines a policy pi,ai of keeping or dropping connections between the current layer and its previous layers as an ni-dimensional Bernoulli distribution:\noi = f(s; \u03b8c), pi,ai = \u220fni\nj=1 o aij ij (1\u2212 oij) (1\u2212aij), (5)\nwhere f denotes the controller parameterized with \u03b8c. The j-th entry of the output vector oi, denoted by oij \u2208 [0, 1], represents the likelihood probability of the corresponding connection between the i-th layer and\nthe j-th layer being kept. The action ai \u2208 {0, 1}ni is sampled from Bernoulli(oi). aij = 1 means keeping the connection, otherwise dropping it. There are total ni connections for the i-th layer, but the output dimension of LSTM at each time step is K. To unify the action space dimension and LSTM output dimension, we set both to K and the output of each time step take a mask \u2208 {0, 1}K operation, where the mask numbers from 1-th to ni-th element are 1 and others are 0. Finally, the probability distribution of the whole neural network architecture is formed as:\n\u03c0(a1:K |s; \u03b8c) = \u220fK\ni=1 pi,ai (6) 3.3 REWARD SHAPING\nReward shaping is introduced to help the controller make progress to an optimal solution. The reward function is designed for each sample and not only considers the prediction correct or not, but also encourages less computation:\nR(a) = { 1\u2212 \u03b7\u03b1 if predict correctly \u2212\u03b3 otherwise. (7)\nwhere \u03b7 = SUBFLOPsFLOPs measures the percentage of float operations utilized. SUBFLOPs, FLOPs represent the float point operations of the child network and vanilla DenseNet, respectively. In order to maximize the reward, the prediction needs to be correct and SUBFLOPs should be reduced as much as possible. The trade-off between performance and complexity is mainly controlled by \u03b1 and \u03b3 and more details will be discussed in the Section 7.3 of the appendix. 3.4 TRAINING WITH ADVANTAGE ACTOR-CRITIC\nAfter obtaining the feedback from the child network, we modify the Eq (1) as the following expected reward:\nJ(\u03b8c) = Ea\u223c\u03c0\u03b8c [r(s, a)] (8) To maximize Eq (8) and accelerate policy gradient training over \u03b8c, we utilize the advantage actor-critic(A2C) with an estimation of state value function V (s; \u03b8v) to derive the gradients of J(\u03b8c) as:\n\u2207\u03b8cJ(\u03b8c) = \u2211 a (r(s, a)\u2212 V (s; \u03b8v))\u03c0(a|s, \u03b8c)\u2207\u03b8c log \u03c0(a|s, \u03b8c) (9)\nThe Eq (9) can be approximated by using the Monte Carlo sampling method:\n\u2207\u03b8cJ(\u03b8c) = 1\nn n\u2211 t=1 ( r(t)(s, a)\u2212 V (s; \u03b8v) ) \u2207\u03b8c log \u03c0(a|s, \u03b8c) (10)\nwhere n is the batch size. The mini-batch samples share the same child network and perform forward propagation in parallel. Therefore, they have the same policy distribution \u03c0(a|s, \u03b8c) but different r(s, a). We further improve exploration to prevent the policy from converging to suboptimal deterministic policy by adding the entropy of the policy \u03c0(a|s, \u03b8c),H(\u03c0(a|s, \u03b8c)) to the objective function. The gradient of the full objective function takes the form:\n\u2207\u03b8cJ(\u03b8c) = 1\nn n\u2211 t=1 [( r(t)(s, a)\u2212 V (s, \u03b8v) ) \u2207\u03b8c log \u03c0(a|s, \u03b8c) + \u03b2\u2207\u03b8cH(\u03c0(a|s, \u03b8c)) ] (11)\nAs for the value network, we define the loss function as Lv and utilize gradient descent methods to update \u03b8v:\nLv = 1\nn n\u2211 t=1 ( r(t)(s, a)\u2212 V (s; \u03b8v) )2 , \u2207\u03b8vLv = 2 n n\u2211 t=1 ( r(t)(s, a)\u2212 V (s; \u03b8v) ) \u2202V (s; \u03b8v) \u2202\u03b8v (12)\nThe entire training procedure is divided into three stages: curriculum learning, joint training and training from scratch. Algorithm 1 shows the complete recipe for layer-wise pruning.\nCurriculum learning. It is easy to note that the search space scales exponentially with the block layers of DenseNet and there are total \u220fK i=1 2\nni keeping/dropping configurations. We use curriculum learning(Bengio, 2013) like BlockDrop(Wu et al., 2018) to solve the problem that policy gradient is sensitive to initialization. For epoch t (1 \u2264 t < K), the LSTM controller only learns the policy of the last t layers and keeps the policy of the remaining K \u2212 t layers consistent with the vanilla DenseNet. As t \u2265 K, all block layers are involved in the decision making process.\nJoint training. The previous stage just updates parameters \u03b8c and \u03b8v. The controller learns to identify connections between two block layers to be kept or dropped. However, it prevents the agent from learning the optimal architecture. Jointly training the DenseNet and controller can be employed as the next stage so that the controller guides the gradients of \u03b8v to the direction of dropping more connections.\nTraining from scratch. After joint training, several child networks can be sampled from the policy distribution \u03c0(a|s, \u03b8c) and we select the child network with the highest reward to train from scratch, and thus better experimental results have been produced.\nWe summarize the entire process in Algorithm 1. 5 EXPERIMENT\nWe evaluate the LWP method on three benchmarks: CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) and ImageNet 2012 (Deng et al., 2009) and these three datasets are used for image classification. Details of experiments and hyperparameters setting in Appendix 7.3. 5.1 RESULTS ON CIFAR\nPretrained DenseNet. For CIFAR datasets, DenseNet-40-12 and DenseNet-100-12 are selected as the backbone CNN. During the training time, the backbone CNN needs to make predictions with dynamic computation paths. In order to make the backbone CNN adjust to our algorithm strategy, we reproduced the DenseNet-40-12 and DenseNet-100-12 on CIFAR based on Pytorch (Paszke et al., 2017) and the results are shown in Table 1.\nComparisons and analysis. The results on CIFAR are reported in Table 1. For CIFAR-10 dataset and the vanilla DenseNet-40-12, our method has reduced the amounts of FLOPs, parameters by nearly 81.4%, 78.2%, respectively and the test error only increase 1.58%. The exponential power \u03b1 and penalty \u03b3 can be tuned to improve the performance. In this experiment, we just modify hyperparameter \u03b1 from 2 to 3 so that the model\ncomplexity(105M vs 173M FLOPs) is increased while test error rate is reduced to 6.00%.The same law can be observed on the DenseNet-100-12 with LWP. Our algorithm also has advantages on Condensenet (Huang et al., 2018) which needs more expert knowledge and NAS (Zoph & Le, 2016) which takes much search time complexity and needs more parameters but gets higher test error.\nWe can also observe the results on CIFAR-100 from the Table 1 that the amounts of FLOPs in DenseNet with LWP are just nearly 46.5%, 66.3% of the DenseNet-40-12 and DenseNet-100-12. The compression rates are worse than that for CIFAR-10. This may be caused by the complexity of the CIFAR-100 classification task. The more hard task, the more computation is needed. 5.2 RESULTS ON IMAGENET\nPretrained DenseNet. We compress the DenseNet-121-32 which has four dense blocks([6, 12, 24, 16]) on ImageNet. The growth rate of DenseNet-121-32 is 32 and this neural network architecture is equipped with bottleneck layers and compression ratio fixed at 0.5 that are designed to improve the model compactness. In the following section, we prove that the model can be further compressed. This model is initialized by loading the checkpoint file of pretrained model from Pytorch.\nMake comparisons and analysis. Although the bottleneck layer and compression ratio are introduced in DenseNet-121-32, the result shows that there is still much redundancy. As observed from Table 2, we can still reduce 54.7% FLOPs and 35.2% parameters of the vanilla DenseNet-121-32 with 1.84% top-1 and 1.28% top-5 test error increasing. 5.3 QUANTITATIVE RESULTS\nIn this section, we argue that our proposed methods can learn more compact neural network architecture by analyzing the number of input channel in DenseNet layer and the connection dependency between a convolution layer with its preceding layers.\nIn Figure 3 left, the red bar represent the number of input channel in DenseNet-40-12-LWP (D40-12-LWP) and the blue bar represent the number of input channel in vanilla DenseNet. We can observe that the number of input channels grows linearly with the layer index because of the concatenation operation and D40-12-LWP has layer-wise input channels identified by the controller automatically. The input channel is 0 means this layer is dropped so that the block layers is reduced from 36 to 26. The number of connections between a layer with its preceding layers can be obtained from the right panel of Figure 3. In Figure 3 right, the x, y axis define the target layer t and source layer s. The small square at position (s, t) represents the connection dependency of target layer t on source layer s. The pixel value of position (s, t) is evaluated with the average absolute filter weights of convolution layers in D40-12-LWP. One small square means one connection and the number of small squares in the vertical direction indicates the number of connections to target layer t.\nAs reported by the paper DenseNet(Huang et al., 2017b), there are redundant connections because of the low kernel weights on average between some layers. The right panel of Figure 3 obviously shows that the values of these small square connecting the same target layer t are almost equal which means the layer t almost has the same dependency on different preceding layers. Naturally, we can prove that the child network learned from vanilla DenseNet is quite compact and efficient. 6 CONCLUSION\nWe propose an algorithm strategy to search efficient child network of DenseNet with reinforcement learning agent. The LSTM is used as the controller to layer-wise prune the redundancy connections. The whole process is divided into three stages: curriculum learning, joint training and training from scratch. The extensive experiments based on CIFAR and ImageNet show the effectiveness of our method. Analyzing the child network and the filter parameters in every convolution layer prove that our proposed method can learn to search compact and efficient neural network architecture. 7 APPENDIX\n 7.1 DATASETS AND EVALUATION METRICS\nCIFAR-10 and CIFAR-100 consists of 10 and 100 classes images with 32\u00d7 32 RGB pixels. Both datasets contain 60, 000 images, of which 50, 000 images for training sets and 10, 000 images for test sets. We use a standard data pre-processing and augmentation techniques and the complete procedure is: normalize the data by using the channel means and standard deviations, centrally pad the training images with size 4, randomly crop to restore 32\u00d7 32 images and randomly flip with probability 0.5 horizontally. The evaluation metric in CIFAR is the prediction accuracy.\nThere are total 1.33 million colored images with 1000 visual classes in ImageNet, 1.28 million for training images and 50k for validation images. We also adopt the data-augmentation scheme for pre-processing, ie: resize the images to 256\u00d7256, normalize the images using channel means and standard derivations, randomly crop to 224\u00d7 224 and flip horizontally at training time but apply a center crop with size 224\u00d7 224 at test time. The performance in ImageNet is measured by both top-1 and top-5 prediction accuracy. 7.2 TRAINING CONFIGURATIONS\nTraining configurations for CIFAR. Based on the pretrained DenseNet, the LSTM controller is trained with batch size 128 for 1000 epochs during the curriculum learning procedure and ADAM optimizer without weight decay is adopt. The learning rate starts from 10\u22123 and it is lowered by 10 times at epoch 500 and 750. For the joint training, we fix the learning rate at 10\u22124 and finetune the model for 1000 epochs. Then we select the optimal child network with highest reward to train from scratch. In the last stage, the SGD optimizer with a weight decay of 10\u22124 and a Nesterov momentum of 0.9 without dampening is adopt. We train the optimal child network with mini-batch size 64 and a cosine shape learning rate from 0.1 to 0 for 300 epochs.\nTraining configurations for ImageNet. For curriculum learning and joint training, we set the epochs 90, 50 respectively and batch size 1024. In curriculum learning procedure, the learning rate is set to 1e\u22123 and is lowered 10 times in epoch 45 and 75. The learning rate is fixed at 1e\u22124 for joint training procedure. We use the same optimizer parameters as CIFAR experiments. At last, the learned optimal child network is optimized like DenseNet(Huang et al., 2017b). 7.3 HYPERPARAMETERS SEARCH\nWe use reward shaping technique in our model and the detailed reward formulation is defined in Eq (7). The trade-off between the model performance and complexity can be controlled by adjusting different reward functions. As shown in Eq (7), the reward function mainly depends on the exponential power \u03b1 of FLOPs ratio and the penalty \u2212\u03b3. We mainly analyzed and explored these two factors of the child network (D40-12-LWP) based on DenseNet-40-12 (D40-12) and CIFAR-10 dataset in the following section.\nExponential power. Given one policy, pass a image to the child network and we hope to get higher reward if the prediction is correct. The lower the FLOPs of the child network, the larger the reward value if fix \u03b1. On the contrary, in order to get the same reward value, the exponential power \u03b1 is bigger and the model complexity is larger. As shown in Figure 4 (a) (b), with setting exponential power \u03b1 = 1/3, 1/2, 1, 2, 3 and fixing \u03b3 = 0.5, the classification accuracy and FLOPs of the child network also increase gradually.\nPenalty. Considering the incorrect prediction, the penalty \u2212\u03b3 is given as the feedback. The bigger \u03b3 means the controller emphasis on the model performance rather than the reduced model complexity. \u03b3 is set to\n0.25, 0.5, 0.75, 1 and exponential power \u03b1 is fixed at 2, respectively. The results is shown in Figure 4 (c) (d) and we can observe that both curves are increased first and then decreased. 7.4 ALGORITHM\nAlgorithm 1 The pseudo-code for layer-wise pruning. Input: Training dataset Dt; Validation dataset Dv; Pretrained DenseNet.\nInitialize the parameters \u03b8c of the LSTM controller and \u03b8v of the value network randomly. Set epochs for curriculum learning, joint training and training from scratch to Mcl, M jt and Mfs respectively and sample Z child networks.\nOutput: The optimal child network 1: //Curriculum learning 2: for t = 1 to Mcl do 3: o = f(s; \u03b8c) 4: if t < K \u2212 t then 5: for i = 1 to K \u2212 t do 6: o[i, 0 : i] = 1 7: o[i, i :] = 0 8: end for 9: end if\n10: Sample a from Bernoulli(o) 11: DenseNet with policy makes predictions on the training dataset Dt 12: Calculate feedback R(a) with Eq (7) 13: Update parameters \u03b8c and \u03b8v with Eq (11) and Eq (12) respectively 14: end for 15: //Joint training 16: for t = 1 to M jt do 17: Simultaneously train DenseNet and the controller 18: end for 19: for t = 1 to Z do 20: Sample a child network from \u03c0(a|s, \u03b8c) 21: Execute the child network on the validation dataset Dv 22: Obtain feedback R(t)(a) with Eq (7) 23: end for 24: Select the child networkN with highest reward 25: //Training from scratch 26: for t = 1 to Mfs do 27: Train the child networkN from scratch 28: end for 29: return The optimal child networkN", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Qualitatively characterizing neural network optimization problems , Authors: Ian J Goodfellow Oriol Vinyals... , 2014\n2.  Title: Train longer, generalize better: closing the generalization gap in large batch training of neural networks , Authors: Elad Hoffer Itay Hubara... , 2017\n3.  Title: On large-batch training for deep learning: Generalization gap and sharp minima , Authors: Nitish Shirish Keskar Dheevatsa Mudigere... , 2016\n4.  Title: Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent , Authors: Chris Junchi Li Lei Li... , 2017\n5.  Title: Visualizing the loss landscape of neural nets , Authors: Hao Li Zheng Xu... , 2017\n6.  Title: Stochastic gradient descent as approximate bayesian inference , Authors: Stephan Mandt Matthew D Hoffman... , 2017\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["The non-convex nature of the loss landscape of deep neural networks (DNN) lends them the intuition that over the course of training, stochastic optimization algorithms explore different regions of the loss surface by entering and escaping many local minima due to the noise induced by mini-batches. But is this really the case? This question couples the geometry of the DNN loss landscape with how stochastic optimization algorithms like SGD interact with it during training. Answering this question may help us qualitatively understand the dynamics of deep neural network optimization. We show evidence through qualitative and quantitative experiments that mini-batch SGD rarely crosses barriers during DNN optimization. As we show, the mini-batch induced noise helps SGD explore different regions of the loss surface using a seemingly different mechanism. To complement this finding, we also investigate the qualitative reason behind the slowing down of this exploration when using larger batch-sizes. We show this happens because gradients from larger batch-sizes align more with the top eigenvectors of the Hessian, which makes SGD oscillate in the proximity of the parameter initialization, thus preventing exploration.", "There have been previous work on visualizing the loss surface although from different motivations. Perhaps Goodfellow et al. (2014) is most similar to our work since we use the loss interpolation tool suggested in their paper to perform our analysis. They perform interpolation between the initial and final parameters and based on their finding, draw the conclusion that the loss along the line connecting these two points does not have any barriers. We note that we use their tool for a different purpose and our conclusions are fundamentally different from theirs because we use the observations to investigate whether SGD crosses barriers during optimization over deep networks\u2019 loss landscape. Li et al. (2017b) also visualize the loss landscape of different network architectures. Our work is closely related to a number of recent papers that study SGD as a diffusion process because we present a complementary qualitative view to an aspect of their theory. Hoffer et al. (2017) hypothesize this view based on the evidence that the parameter distance moved by SGD from initialization as a function of the number of iterations resembles a diffusion process. Li et al. (2017a) hypothesize this behavior of SGD and theoretically show that this diffusion process would allow SGD to escape sharp local minima. The authors use this theoretical result to support the findings of Keskar et al. (2016) who find that SGD with small mini-batch size find wider minima. Kushner & Yin (2003); Mandt et al.", "The subject of how a given algorithm explores the landscape is still a poorly understood area in training neural networks. There is a large body of recent work that attempts to shed light on this puzzle, and each one tries to claim their share in the furthering of the understanding of the relationship between the geometry of the landscape and the dynamics that one chooses in optimization. The present paper is a fine addition to the literature with interesting observations and novel questions, however, it falls short in many core areas: An apparent work in progress that has a great potential. \n\nA major question (that I hope will easily be addressed) is on the definition of the barrier itself. According to the text, a barrier is defined judging by the minima of two 1-dimensional segments that connect weights connecting three consecutive steps: if the minimum of the line segment defined by the latter step is larger than the former, then it declared that a barrier is crossed. In a low dimensional world, this makes total sense, however, I fail to understand what kind of barrier it implies on the geometry of the landscape: Can the 1-dimensional lines be on the sides of a valley? Can one find *another* 1-dimensional projection for which the inequality is broken? How do such dependencies change the understanding of the problem? And if one is indeed only interested in the flat line segments (since SGD is making discrete steps), then one can, in principle, observe barrier crossing in a convex problem, as well? Is there an argument for otherwise? Or if it is a notion that applies equally well in a convex case then how should we really think about the barrier crossing? On the opposite point of view, can one not imagine a barrier crossing that doesn't appear in this triangular inequality above?\n\nThe paper is full of empirical evidence that is guided by a simple observable that is very intuitive, however, it lacks a comprehensive discussion on the new quantity they propose that I consider a major flaw, but that I think (hope) that the authors can fix very easily. Some minor points that would improve the readability and clarity for the reader:\n- The figures are not very reader-friendly, this can be improved by better using the whitespaces in the paper but it can also be improved by finding further observables that would summarize the observations instead of showing individual consecutive line interpolations.\n- What are the values of the y-axis in Figure 5 and 6? Are they the top eigenvalues of the Hessian?\n- In the models that are compared in Figure 7, what are their generalization properties (early stopping and otherwise)?\n- The interpretation at the end of p. 6 may be a good motivation for the reader if it had been introduced earlier for that section.\n\n\n4: Ok but not good enough - rejection"], "input": "1 INTRODUCTION\nThe non-convexity of the deep neural network (DNN) loss surface makes the behavior of optimization algorithms less intuitive compared to the convex setting. Moreover, optimization in DNNs is no longer about finding any minimum, but rather about finding ones that generalizes well (Keskar et al., 2016). Since deep networks are initialized randomly, finding such minima will require exploration of different regions of the loss surface. This intuition has been formalized in recent papers that study stochastic gradient descent (SGD) as a diffusion process (Hoffer et al., 2017; Smith & Le, 2017; Jastrzebski et al., 2017; Chaudhari & Soatto, 2017). Briefly, these papers show that SGD simulates a discrete approximation of stochastic differential equation (SDE), and hence performs a random walk on the potential induced by the DNN loss surface.\nIn this work, we complement the diffusion perspective of SGD with a qualitative view of how SGD explores different regions of the non-convex loss landscape of deep neural networks through empirical evidence. Intuitively, when performing random walk on a potential, one would expect barriers being crossed quite often during the process. We show in this work that SGD rarely crosses any barriers along its path during the course of training. By this observation, we do not claim that SGD does not simulate diffusion. Through experimental deductions, we show an alternate mechanism that SGD seems to dominantly use to explore different regions of the non-convex loss landscape.\nFurther, it is known that larger batch-sizes slow down the diffusion process (Hoffer et al., 2017). We show the qualitative reason behind this slow down to be an oscillation behavior of SGD which prevents it from moving far away from initialization. This behavior is a result of the mini-batch gradients becoming increasingly aligned with the top eigenvectors of the Hessian for larger batch-sizes. This behavior is known to slow down convergence in optimization theory (for instance consider the motivation behind momentum (Polyak, 1964; Sutskever et al., 2013)). We discuss how it also slows down explorations in the non-convex setting of deep network loss surface.\nExperiments are conducted on multiple data sets, architectures and hyper-parameter settings. The findings mentioned above hold true on all of them. 2 SETUP\nWe now describe the details of how we study the existence of barriers along the optimization path of SGD. The main tool we use for studying the DNN loss surface along SGD\u2019s path is to interpolate the loss surface between parameters before and after each training update. We note that this strategy of interpolating the loss surface between parameters was introduced by Goodfellow et al. (2014). In their paper, the interpolation is conducted between initial and final (after training) parameter values for analysis purposes. In contrast, we compute interpolations before and after each training update because this interpolation precisely tells us whether or not SGD crosses a barrier during an update step. We say a barrier is crossed when we see a point in the parameter space interpolated between the parameters just before and after an update step, such that the loss at the barrier point is higher than the loss at both the other points.\nConsider that the parameters \u03b8 of a neural network are initialized to a value \u03b80. When using an optimization method to update these parameters, the tth update step takes the parameter from \u03b8t to \u03b8t+1 using estimated gradient gt as,\n\u03b8t+1 = \u03b8t \u2212 \u03b7gt (1)\nwhere \u03b7 is the learning rate. Notice the tth update step implies the tth epoch only in the case when using the full batch gradient descent (GD). In the case of stochastic gradient descent, one iteration is an update from gradient computed from a mini-batch. We then interpolate the DNN loss between the convex combination of \u03b8t and \u03b8t+1 by considering parameter vectors \u03b8\u03b1t = (1\u2212 \u03b1)\u03b8t + \u03b1\u03b8t+1, where \u03b1 \u2208 [0, 1] is chosen such that we obtain 10 samples uniformly placed between these two parameter points. We note that even though the updates are performed using mini-batches for SGD, the training loss values we compute for the interpolation use the full dataset to visualize the actual loss landscape. 3 BARRIERS AND EXPLORATION DURING SGD TRAINING\nFor this section, we perform experiments on MNIST (Lecun & Cortes) and CIFAR-10 (Krizhevsky, 2009) datasets, and use multi-layer perceptrons (MLP), VGG-11 (Simonyan & Zisserman, 2014) and Resnet-56 (He et al., 2016) architectures with various batch sizes and learning rates. We discuss our observations for VGG-11 architecture on CIFAR-10 dataset (figure 1) as a reference but the same conclusions hold for experiments on MLP trained on MNIST (figure 3) and Resnet-56 trained on CIFAR-10 (figure 2).\nWe train VGG-11 on CIFAR-10 with a batch size of 100 and fixed learning rate of 0.1. We report the visualization of loss interpolation between consecutive iterations for 40 iterations from epochs 1, 2, 25 and 100 for visual clarity. The interpolation is shown in figure 1. To be clear, the x-axis is calibrated by the number of iterations, and there are 10 interpolated loss values between each consecutive iteration (vertical gray lines) in the training loss plots. In these plots, we find two interesting behavior of SGD.\nFirst, we find that the interpolated loss between every consecutive update from SGD optimization update appears to be a quadratic-like structure with a minimum in between. Note that while this is not visible for epochs 25 and 100, we later show quantitative measurements that ensures this claim. This plot thus shows that in the iterations plotted, SGD rarely crosses barriers.\nSecond, we observe how the minimum of each interpolation evolves as training progresses. This is highlighted in figure 1 (a) with a dashed orange line. We find that this minimum has ups and downs along the path of SGD for all our interpolation plots. To draw deductions from this observation, consider a simple example that helps us understand this scenario concretely. Let parameter points \u03b8A, \u03b8B and \u03b8C be a result of three consecutive SGD updates with loss values `A, `B and `C (using full training set). Note that since these are only three points, they exist in a two dimensional subspace and the loss value can be imagined along the third dimension. Then corresponding to the behavior in the plot, there is a parameter point \u03b8AB between \u03b8A and \u03b8B on the line connecting these two points, which has a loss value `AB < `A, `B . Similarly there is a point \u03b8BC between \u03b8B and \u03b8C on the line connecting these two points, which has a loss value `BC < `B , `C . Given this construction, for any configuration of \u03b8A, \u03b8B and \u03b8C on the two dimensional plane, it is easy to see that if `AB < `BC , any path from \u03b8AB to \u03b8BC will have loss values that must increase at some point. Hence, what\nthis construction essentially represents (as we refer to it), is a situation where SGD has moved over a barrier. Therefore, the ups and downs of the minimum between loss interpolations in figure 1 (a,b,c,d) represents SGD moving over barriers. In this way we find that when running SGD on the loss surface of deep networks, instead of crossing barriers, a more dominant way SGD performs exploration is by moving over them.\nThe same qualitative analysis for SGD with different hyper-parameters are also shown in section 1 in appendix. The observations we described here remain consistent for all these experiments.\nSo far we showed qualitative visualizations to make the claim that SGD rarely crosses barriers. In order to show that the claim extends to the rest of the training instead of only a few iterations we showed above, we now quantitatively measure how many barriers are crossed for the entire epoch in different phase of training. This result is shown in table 1 for VGG-11 and Resnet-56 trained on CIFAR-10 (trained for 100 epochs) and an MLP trained on MNIST (trained for 40 epochs). We note that each case, an epoch consists of more than 450 iterations. As we see, a negligible number of barriers are crossed for most parts of the training compared to the number of iterations performed during each epoch. For concreteness, we further compute the number of barriers crossed for the first 40 epochs for VGG-11 on CIFAR-10 as shown in Figure 4 and reach the same conclusion.\nat either points. Note that even for these barriers, their heights (defined by L(\u03b8t)+L(\u03b8t+1)\u22122L(\u03b8 min t )\n2 ) are substantially smaller compared with the value of loss at the corresponding iterations (not mentioned here), meaning they are not significant barriers. 4 THE EFFECT OF BATCH-SIZE ON EXPLORATION\n 4.1 ANALYSIS\nHoffer et al. (2017) discuss that SGD training with different batch-sizes leads to different diffusion rates (very large batch sizes being slower). Further, when training for the same number of epochs, a larger batch training performs less number of iterations. Combining these two observations, they reach the conclusion that large batch training makes the diffusion process slow. As empirical evidence, they show that the distance of parameters from initialization evolves logarithmically in the number of iterations.\nWe now present a complementary optimization perspective to their observation. To continue, we introduce the following notations. Let pi(\u03b8) denote the predicted probability output (of the correct class in the classification setting for instance) of a DNN parameterized by \u03b8 for the ith data sample (in total N samples). Then the negative log likelihood loss for the ith sample is given by Li(\u03b8) = \u2212 log(pi(\u03b8)). The gradient gB(\u03b8) from mini-batch SGD at a parameter value \u03b8 is expressed as, gB(\u03b8) = 1B \u2211 i\u2208B \u2202Li(\u03b8) \u2202\u03b8 , g\u0304(\u03b8) denotes the expected gradient using all training samples, B is the mini-batch size (and we have also overloaded it to mean the mini-batch set) and C(\u03b8) is the gradient covariance matrix at \u03b8. Then the relation between the Hessian H(\u03b8) and the dataset gradient covariance C(\u03b8) for negative log likelihood loss is described by the Gauss-Newton decomposition as follows,\nH(\u03b8) = C(\u03b8) + g\u0304(\u03b8)g\u0304(\u03b8)T + 1\nN N\u2211 i=1 \u2202Li(\u03b8) \u2202pi(\u03b8) \u00b7 \u2202 2pi(\u03b8) \u2202w2 (2)\nwhere H(\u03b8) is the Hessian of the loss. The derivation can be found in section B of the appendix.\nTo continue with our argument, we note that it has been discussed by Shwartz-Ziv & Tishby (2017) that early on during training, the mean gradient over the training set is larger in magnitude compared to the variance in gradients. The above argument essentially says that the scale of mean gradient g\u0304(\u03b8) is larger compared with the scale of C(\u03b8). Ignoring the second order term in the Gauss-Newton decomposition above, we see that the mean gradient must be aligned with the top eigenvectors of the Hessian since the scale of gradient covariance is much smaller early on during training. Finally, we note that using large batch-sizes makes the mini-batch gradient closer to the mean gradient by reducing the scale of mini-batch gradient covariance as shown by Hoffer et al. (2017),\ncov(gB(\u03b8),gB(\u03b8)) = ( 1\nB \u2212 1 N\n) C(\u03b8) (3)\nThe two arguments together imply that gradients from larger batch-sizes are likely to be more aligned with the high curvature directions of the loss surface especially early on during training.\nIn convex optimization theory, when gradients point along the top eigenvectors of the Hessian (also referred to as the sharp directions of the loss surface), optimization exhibits under-damped convergence, meaning it oscillates along the sharp directions in the case when the learning rate is smaller than a certain threshold. Applying this idea to non-convex loss landscapes, a large alignment between the mini-batch gradient and the sharp directions should also lead to oscillations. At this point, we depart from the conclusions of the convex setting and recall our observation in the previous section that the interpolation between consecutive iterations has a quadratic like shape and SGD moves over barriers for the deep network loss surface. We thus hypothesize that a lower alignment between mini-batch gradients and the sharp directions of the loss surface makes SGD exploration faster by exhibiting less oscillation, and vice-versa. 4.2 EMPIRICAL VERIFICATION\nBased on the theoretical analysis above, we first conduct experiments to empirically verify that the alignment of mini-batch gradient gB(\u03b8) and hessian H(\u03b8) increases when we increase mini-batch size. To do so, we calculate the alignment of mini-batch gradient gB(\u03b8) and hessian H(\u03b8) as\ngTB(\u03b8)H(\u03b8)gB(\u03b8)\n\u2016gB(\u03b8)\u201622 . (4)\nFigure 5 and figure 6 show the alignments calculated according to Equation 4 on both VGG-11 with CIFAR-10 and Resnet-56 with CIFAR-100 separately at the end of Epoch 5 and Epoch 10. We calculate the alignment for mini-batch size 100, 1000, 10000 and 45000 (which is the dataset size). For every mini-batch size, we sample 50 different batches and calculate the alignment of the current mini-batch gradient with the hessian and show both the mean and standard deviation of alignments in the plots. From both figure 5 and figure 6, we can see that the alignment between mini-batch gradient and hessian is larger for larger mini-batch size.\nBased on the empirical verification between mini-batch gradient and sharp directions above, we now verify our argument whether it leads SGD to oscillate in the proximity of the parameter initialization, thus slowing down exploration. Note the latter has been shown by Hoffer et al. (2017). Therefore, to substantiate our claim, we show the degree of oscillation in SGD increases with large batchsize. Specifically, while training deep networks, we keep track of the cosine of the angle between mini-batch gradients from every two consecutive SGD iterations,\ncos(gt\u22121,gt) := gTt\u22121gt\n(\u2016gt\u22121\u20162\u2016gt\u20162) . (5)\nFigure 7 shows the consine calculated according to Equation 5 for Resnet-56 on CIFAR-10 and WResnet on CIFAR-100. Experiments are run with the same learning rate for batch size 500, 5000 and 45000 (dataset size). We can see from the plot that the cosine of the angle between mini-batch gradients from two consecutive iterations remains smaller for larger batch sizes, which indicates that the gradients from two consecutive iterations point more in opposite directions for larger batch sizes. Together with the parameter distance results from Hoffer et al. (2017) that shows that within the same number of iterations, the parameter norm for larger batch sizes is smaller, our experiment verifies that for larger batch sizes, SGD oscillates more in the proximity of the parameter initialization instead of exploring farther away regions. 6 DISCUSSION AND CONCLUSION\nThrough qualitative results that showed how SGD interacts with the DNN loss surface, we showed evidence that SGD rarely crosses barriers during training. We presented an alternate mechanism that SGD uses to explore different regions of the deep network loss landscape.\nWe draw similarities between the optimization trajectory in DNNs that we have empirically found, with those in quadratic loss optimization (see section 5 of LeCun et al. (1998)). Based on our empirical evidence, we found that the loss interpolation between parameters from consecutive updates is a quadratic-like shape. This is reminiscent of optimization in a quadratic loss setting with a non-isotropic positive semi-definite Hessian, where the optimal learning rate \u03b7 causes underdamping without divergence along eigenvectors of the Hessian which have eigenvalues \u03bbi such that \u03bb\u22121i < \u03b7 < 2\u03bb \u22121 i .\nIn the second part of our analysis, we investigated the role of batch-size in exploration, for different regions during SGD optimization of a DNN loss surface. We presented an argument showing mini-batch gradients from larger batch-sizes should align more with the high curvature directions of the loss surface, especially early during training when the scale of mean gradients dominates over gradient covariance. Additionally, we present a complementary view of the exploration aspect of SGD that stems from its diffusion perspective, and show that the alignment of the mini-batch gradient with the sharp directions of the Hessian leads to oscillations preventing SGD from exploring regions far from the initialized parameters.\nFinally, much of what we have discussed is based on the loss landscape of specific datasets and architectures along with network parameterization choices like rectified linear activation units (ReLUs) and batch normalization Ioffe & Szegedy (2015). These conclusions may differ depending on these choices. In these cases analysis similar to ours can be performed to see if similar dynamics hold or not. Studying these dynamics may provide more practical guidelines for setting optimization hyperparameters. A OPTIMIZATION TRAJECTORY\nThis is a continuation of section 3.1 in the main text. Here we show further experiments on other datasets, architectures and hyper-parameter settings. The analysis of GD training for Resnet-56 on CIFAR-10, MLP on MNIST and VGG-11 on tiny ImageNet are shown in figures 8, 16 and 19 respectively. Similarly, the analysis of SGD training for Resnet-56 on CIFAR-10 dataset with batch size of 100 and learning rate 0.1 for epochs 1, 2, 25 and 100 are shown in figures 9, 10, 11 and 12 respectively. The analysis of SGD training for VGG-11 on CIFAR-10 with the batch size of 100 and learning rate 0.1 on epochs 2, 25,100 are shown in figures 13, 14 and 15. The analysis of SGD training for MLP on MNIST for epochs 1 and 2 are shown in figures 17 and 18. The analysis of SGD training for VGG-11 on tiny ImageNet for epochs 1 is shown in figure 20. We also conducted the same experiment and analysis on various batch sizes and learning rates for every architecture. Results of VGG-11 can be found in figures 21, 22, 23 and 24. Results of Resnet-56 can be found in figures 25, 26, 27 and 28. The observations and rules we discovered and described in section 3 are all consistent for all these experiments. Specifically, for the interpolation of SGD for VGG-11 on tiny ImageNet, the valley-like trajectory is weird-looking but even so, according to our quantitative evaluation there is no barrier between any two consecutive iterations.\nB IMPORTANCE OF SGD NOISE STRUCTURE\nHere we derive in detail the relation between the Hessian and gradient covariance for the negative log likelihood loss Li(\u03b8) = \u2212 log(pi(\u03b8)). Note we use the fact that for this particular loss function, \u2202Li(\u03b8) \u2202pi(\u03b8) = \u2212 1pi(\u03b8) , and \u22022Li(\u03b8) \u2202pi(\u03b8)2 = 1 p2i (\u03b8) , which yields \u2202 2Li(\u03b8) \u2202pi(\u03b8)2 = ( \u2202Li(\u03b8) \u2202pi(\u03b8) )2 .\nH(\u03b8) = 1\nN N\u2211 i=1 \u22022Li(\u03b8) \u2202\u03b82\n(6)\n= 1\nN N\u2211 i=1 \u2202 \u2202\u03b8 ( \u2202Li(\u03b8) \u2202pi(\u03b8) \u00b7 \u2202pi(\u03b8) \u2202\u03b8 ) (7)\n= 1\nN N\u2211 i=1 \u22022Li(\u03b8) \u2202pi(\u03b8)2 \u00b7 \u2202pi(\u03b8) \u2202\u03b8 \u2202pi(\u03b8) \u2202\u03b8 T + \u2202Li(\u03b8) \u2202pi(\u03b8) \u00b7 \u2202 2pi(\u03b8) \u2202\u03b82 (8)\n= 1\nN N\u2211 i=1 ( \u2202Li(\u03b8) \u2202pi(\u03b8) )2 \u00b7 \u2202pi(\u03b8) \u2202\u03b8 \u2202pi(\u03b8) \u2202\u03b8 T + \u2202Li(\u03b8) \u2202pi(\u03b8) \u00b7 \u2202 2pi(\u03b8) \u2202\u03b82 (9)\n= 1\nN N\u2211 i=1 \u2202Li(\u03b8) \u2202\u03b8 \u2202Li(\u03b8) \u2202\u03b8 T + \u2202Li(\u03b8) \u2202pi(\u03b8) \u00b7 \u2202 2pi(\u03b8) \u2202\u03b82 (10)\n= C(\u03b8) + g\u0304(\u03b8)g\u0304(\u03b8) T +\n1\nN N\u2211 i=1 \u2202Li(\u03b8) \u2202pi(\u03b8) \u00b7 \u2202 2pi(\u03b8) \u2202\u03b82 (11)\nwhere g\u0304(\u03b8) = 1N \u2211N i=1 \u2202Li(\u03b8) \u2202\u03b8 . C DISCUSSION\nIn the main text, we talk about converge in the quadratic setting depending on the value of learning rate relative to the largest eigenvalue of the Hessian. The convergence in this setting has been visualized in ??.", "source": "paper_review", "evaluation": "rouge"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Incremental decoding and training methods for simultaneous translation in neural machine translation , Authors: Fahim Dalvi Nadir Durrani... , 2018\n2.  Title: Don\u2019t until the final verb wait: Reinforcement learning for simultaneous machine translation , Authors: Alvin Grissom II He He... , 2014\n3.  Title: Learning to translate in real-time with neural machine translation , Authors: Jiatao Gu Graham Neubig... , 2017\n4.  Title: Learning hard alignments with variational inference , Authors: Dieterich Lawson Chung-Cheng Chiu... , 2018\n5.  Title: Learning online alignments with continuous rewards policy gradient , Authors: Yuping Luo Chung-Cheng Chiu... , 2017\n6.  Title: STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework , Authors: Mingbo Ma Liang Huang... , 2019\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Simultaneous machine translation models start generating a target sequence before they have encoded the source sequence. Recent approaches for this task either apply a fixed policy on a state-of-the art Transformer model, or a learnable monotonic attention on a weaker recurrent neural network-based structure. In this paper, we propose a new attention mechanism, Monotonic Multihead Attention (MMA), which extends the monotonic attention mechanism to multihead attention. We also introduce two novel and interpretable approaches for latency control that are specifically designed for multiple attention heads. We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We analyze how the latency controls affect the attention span and we study the relationship between the speed of a head and the layer it belongs to. Finally, we motivate the introduction of our model by analyzing the effect of the number of decoder layers and heads on quality and latency.", "Recent work on simultaneous machine translation falls into three categories. In the first one, models use a rule-based policy for reading input and writing output. Cho & Esipova (2016) propose a WaitIf-* policy to enable an offline model to decode simultaneously. Ma et al. (2019) propose a wait-k policy where the model first reads k tokens, then alternates between read and write actions. Dalvi et al. (2018) propose an incremental decoding method, also based on a rule-based schedule. In the second category, a flexible policy is learnt from data. Grissom II et al. (2014) introduce a Markov chain to phrase-based machine translation models for simultaneous machine translation, in which they apply reinforcement learning to learn the read-write policy based on states. Gu et al. (2017) introduce an agent which learns to make decisions on when to translate from the interaction with a pre-trained offline neural machine translation model. Luo et al. (2017) used continuous rewards policy gradient for online alignments for speech recognition. Lawson et al.", "This paper proposes a fully transformer-based monotonic attention framework that extends the idea of MILK. Though the idea of monotonic multi-head attention sounds interesting, I still have some questions below:\n\nAbout the method:\n   1. Is that possible that the MMA would have worse latency than MILK since all the attention heads need to agree to write while MILK only has one attention head?\n   2. Is there any attention order between different attention head?\n   3. I think the MMA only could control the latency during training time, which would produce different models with different latency. Is there any way that enables MMA to control the latency during inference time? Can we change the latency for on given model by tuning the requirements mentioned in Eq. (1)?\n\nAbout the experiments:\n    1. Do you have any explanation of why both MMA-H and MMA-IL have better BLEU when AL is small? The results in fig 2 seem counterintuitive. \n    2. I suggest the authors do more analysis of the difference between different attention heads to prove the effectiveness of MMA. \n    3. For the left two figures in fig 4, which one is the baseline, and which one is the proposed model?\n\nI also suggest the authors present more real sample analysis and discussions about the experiments.\n6: Weak Accept"], "input": "1 INTRODUCTION\nSimultaneous machine translation adds the capability of a live interpreter to machine translation: a simultaneous model starts generating a translation before it has finished reading the entire source sentence. Such models are useful in any situation where translation needs to be done in real time. For example, simultaneous models can translate live video captions or facilitate conversations between people speaking different languages. In a usual translation model, the encoder first reads the entire sentence, then the decoder writes the target sentence. On the other hand, a simultaneous neural machine translation model alternates between reading the input and writing the output using either a fixed or learned policy.\nMonotonic attention mechanisms fall into the flexible policy category, in which the policies are automatically learned from data. Recent work exploring monotonic attention variants for simultaneous translation include: hard monotonic attention (Raffel et al., 2017), monotonic chunkwise attention (MoChA) (Chiu & Raffel, 2018) and monotonic infinite lookback attention (MILk) (Arivazhagan et al., 2019). MILk in particular has shown better quality/latency trade-offs than fixed policy approaches, such as wait-k (Ma et al., 2019) or wait-if-* (Cho & Esipova, 2016) policies. MILk also outperforms hard monotonic attention and MoChA; while the other two monotonic attention mechanisms only consider a fixed window, MILk computes a softmax attention over all previous encoder states, which may be the key to its improved latency-quality tradeoffs. These monotonic attention approaches also provide a closed-form expression for the expected alignment between source and target tokens.\nHowever, monotonic attention-based models, including the state-of-the-art MILk, were built on top of RNN-based models. RNN-based models have been outperformed by the recent state-of-the-art Transformer model (Vaswani et al., 2017), which features multiple encoder-decoder attention layers and multihead attention at each layer.\n\u2217Work conducted during an internship at Facebook 1The code is available at https://github.com/pytorch/fairseq/tree/master/\nexamples/simultaneous_translation\nWe thus propose monotonic multihead attention (MMA), which combines the high translation quality from multilayer multihead attention and low latency from monotonic attention. We propose two variants, Hard MMA (MMA-H) and Infinite Lookback MMA (MMA-IL). MMA-H is designed with streaming systems in mind where the attention span must be limited. MMA-IL emphasizes the quality of the translation system. We also propose two novel latency regularization methods. The first encourages the model to be faster by directly minimizing the average latency. The second encourages the attention heads to maintain similar positions, preventing the latency from being dominated by a single or a few heads.\nThe main contributions of this paper are: (1) A novel monotonic attention mechanism, monotonic multihead attention, which enables the Transformer model to perform online decoding. This model leverages the power of the Transformer and the efficiency of monotonic attention. (2) Better latency/quality tradeoffs compared to the MILk model, the previous state-of-the-art, on two standard translation benchmarks, IWSLT15 English-Vietnamese (En-Vi) and WMT15 German-English (DeEn). (3) Analyses on how our model is able to control the attention span and on the relationship between the speed of a head and the layer it belongs to. We motivate the design of our model with an ablation study on the number of decoder layers and the number of decoder heads. 2 MONOTONIC MULTIHEAD ATTENTION MODEL\nIn this section, we review the monotonic attention-based approaches in RNN-based encoder-decoder models. We then introduce the two types of Monotonic Multihead Attention (MMA) for Transformer models: MMA-H and MMA-IL. Finally, we introduce strategies to control latency and coverage. 2.1 MONOTONIC ATTENTION\nThe hard monotonic attention mechanism (Raffel et al., 2017) was first introduced in order to achieve online linear time decoding for RNN-based encoder-decoder models. We denote the input sequence as x = {x1, ..., xT }, and the corresponding encoder states as m = {m1, ...,mT }, with T being the length of the source sequence. The model generates a target sequence y = {y1, ..., yU} with U being the length of the target sequence. At the i-th decoding step, the decoder only attends to one encoder state mti with ti = j. When generating a new target token yi, the decoder chooses whether to move one step forward or to stay at the current position based on a Bernoulli selection probability pi,j , so that ti \u2265 ti\u22121. Denoting the decoder state at the i-th position, starting from j = ti\u22121, ti\u22121 + 1, ti\u22121 + 2, ..., this process can be calculated as follows: 2\nei,j = MonotonicEnergy(si\u22121,mj) (1) pi,j = Sigmoid (ei,j) (2) zi,j \u223c Bernoulli(pi,j) (3)\nWhen zi,j = 1, we set ti = j and start generating a target token yi; otherwise, we set ti = j+ 1 and repeat the process. During training, an expected alignment \u03b1 is introduced to replace the softmax attention. It can be calculated in a recurrent manner, shown in Equation 4:\n\u03b1i,j = pi,j j\u2211 k=1\n( \u03b1i\u22121,k\nj\u22121\u220f l=k (1\u2212 pi,l)\n)\n= pi,j ( (1\u2212 pi,j\u22121)\n\u03b1i,j\u22121 pi,j\u22121 + \u03b1i\u22121,j ) (4) Raffel et al. (2017) also introduce a closed-form parallel solution for the recurrence relation in Equation 5:\n\u03b1i,: = pi,:cumprod(1\u2212 pi,:)cumsum (\n\u03b1i\u22121,: cumprod(1\u2212 pi,:)\n) (5)\nwhere cumprod(x) = [1, x1, x1x2, ..., \u220f|x|\u22121 i=1 xi] and cumsum(x) = [x1, x1 + x2, ..., \u2211|x| i=1 xi]. In practice, the denominator in Equation 5 is clamped into a range of [ , 1] to avoid numerical instabilities introduced by cumprod. Although this monotonic attention mechanism achieves online\n2Note that during training, to encourage discreteness, Raffel et al. (2017) added a zero mean, unit variance pre-sigmoid noise to ei,j .\nlinear time decoding, the decoder can only attend to one encoder state. This limitation can diminish translation quality as there may be insufficient information for reordering.\nMoreover, the model lacks a mechanism to adjust latency based on different requirements at decoding time. To address these issues, Chiu & Raffel (2018) introduce Monotonic Chunkwise Attention (MoChA), which allows the decoder to apply softmax attention to a fixed-length subsequence of encoder states. Alternatively, Arivazhagan et al. (2019) introduce Monotonic Infinite Lookback Attention (MILk) which allows the decoder to access encoder states from the beginning of the source sequence. The expected attention for the MILk model is defined in Equation 6.\n\u03b2i,j = |x|\u2211 k=j ( \u03b1i,k exp(ui,j)\u2211k l=1 exp(ui,l) ) (6) 2.2 MONOTONIC MULTIHEAD ATTENTION\nPrevious monotonic attention approaches are based on RNN encoder-decoder models with a single attention and haven\u2019t explored the power of the Transformer model. 3 The Transformer architecture (Vaswani et al., 2017) has recently become the state-of-the-art for machine translation (Barrault et al., 2019). An important feature of the Transformer is the use of a separate multihead attention module at each layer. Thus, we propose a new approach, Monotonic Multihead Attention (MMA), which combines the expressive power of multihead attention and the low latency of monotonic attention.\nMultihead attention allows each decoder layer to have multiple heads, where each head can compute a different attention distribution. Given queries Q, keys K and values V , multihead attention MultiHead(Q,K, V ) is defined in Equation 7.\nMultiHead(Q,K, V ) = Concat(head1, ..., headH)WO where headh = Attention ( QWQh ,KW K h , V W V h , ) (7)\nThe attention function is the scaled dot-product attention, defined in Equation 8: Attention(Q,K, V ) = Softmax ( QKT\u221a dk ) V (8)\nThere are three applications of multihead attention in the Transformer model:\n1. The Encoder contains self-attention layers where all of the queries, keys and values come from previous layers.\n2. The Decoder contains self-attention layers that allow each position in the decoder to attend to all positions in the decoder up to and including that position.\n3. The Encoder-Decoder attention contains multihead attention layers where queries come from the previous decoder layer and the keys and values come from the output of the encoder. Every decoder layer has a separate encoder-decoder attention.\nFor MMA, we assign each head to operate as a separate monotonic attention in encoder-decoder attention.\nFor a transformer with L decoder layers and H attention heads per layer, we define the selection process of the h-th head encoder-decoder attention in the l-th decoder layer as\nel,hi,j =\n( mjW K l,h(si\u22121W Q l,h) T\n\u221a dk\n) i,j\n(9)\npl,hi,j = Sigmoid(ei,j) (10)\nzl,hi,j \u223c Bernoulli(pi,j) (11) 3MILk was based on a strengthened RNN-based model called RNMT+. The original RNMT+ model (Chen et al., 2018) uses multihead attention, computes attention only once, and then concatenates that single attention layer to the output of each decoder layer block. However, the RNMT+ model used for MILk in Arivazhagan et al. (2019) only uses a single head.\nwhere Wl,h is the input projection matrix, dk is the dimension of the attention head. We make the selection process independent for each head in each layer. We then investigate two types of MMA, MMA-H(ard) and MMA-IL(infinite lookback). For MMA-H, we use Equation 4 in order to calculate the expected alignment for each layer each head, given pl,hi,j . For MMA-IL, we calculate the softmax energy for each head as follows:\nul,hi,j = SoftEnergy =\n( mjW\u0302 K l,h(si\u22121W\u0302 Q l,h) T\n\u221a dk\n) i,j\n(12)\nand then use Equation 6 to calculate the expected attention. Each attention head in MMA-H hardattends to one encoder state. On the other hand, each attention head in MMA-IL can attend to all previous encoder states. Thus, MMA-IL allows the model to leverage more information for translation, but MMA-H may be better suited for streaming systems with stricter efficiency requirements. Finally, our models use unidirectional encoders: the encoder self-attention can only attend to previous states, which is also required for simultaneous translation.\nAt inference time, our decoding strategy is shown in Algorithm 1. For each l, h, at decoding step i, we apply the sampling processes discussed in subsection 2.1 individually and set the encoder step at tl,hi . Then a hard alignment or partial softmax attention from encoder states, shown in Equation 13, will be retrieved to feed into the decoder to generate the i-th token. The model will write a new target token only after all the attentions have decided to write. In other words, the heads that have decided to write must wait until the others have finished reading.\ncli = Concat(c l,1 i , c l,2 i , ..., c l,H i )\nwhere cl,hi = fcontext(h, t l,h i ) =  mtl,hi\nMMA-H\ntl,hi\u2211 j=1\nexp ( ul,hi,j ) \u2211tl,hi j=1 exp ( ul,hi,j )mj MMA-IL (13)\nFigure 1 illustrates a comparison between our model and the monotonic model with one attention head. Compared with the monotonic model, the MMA model is able to set attention to different positions so that it can still attend to previous states while reading each new token. Each head can adjust its speed on-the-fly. Some heads read new inputs, while the others can stay in the past to retain the source history information. Even with the hard alignment variant (MMA-H), the model is still able to preserve the history information by setting heads to past states. In contrast, the hard monotonic model, which only has one head, loses the previous information at the attention layer. 2.3 LATENCY CONTROL\nEffective simultaneous machine translation must balance quality and latency. At a high level, latency measures how many source tokens the model has read until a translation is generated. The model we have introduced in subsection 2.2 is not able to control latency on its own. While MMA allows simultaneous translation by having a read or write schedule for each head, the overall latency is determined by the fastest head, i.e. the head that reads the most. It is possible that a head always reads new input without producing output, which would result in the maximum possible latency. Note that the attention behaviors in MMA-H and MMA-IL can be different. In MMA-IL, a head reaching the end of the sentence will provide the model with maximum information about the source sentence. On the other hand, in the case of MMA-H, reaching the end of sentence for a head only\nAlgorithm 1 MMA monotonic decoding. Because each head is independent, we compute line 3 to 16 in parallel\nInput: x = source tokens,h = encoder states, i = 1, j = 1, tl,h0 = 1, y0 = StartOfSequence. 1: while yi\u22121 6= EndOfSequence do 2: tmax = 1 3: h = empty sequence 4: for l\u2190 1 to L do 5: for h\u2190 1 to H do 6: for j \u2190 tl,hi\u22121 to |x| do 7: pl,hi,j = Sigmoid ( MonotonicEnergy(si\u22121,mj )\n) 8: if pl,hi,j > 0.5 then 9: tl,hi = j\n10: cl,hi = fcontext(h, t l,h i ) 11: Break 12: else 13: if j > tmax then 14: Read token xj 15: Calculate state hj and append to h 16: tmax = j 17: cli = Concat(c l,1 i , c l,2 i , ..., c l,H i ) 18: sli = DecoderLayer l(sl1:i\u22121, s l\u22121 1:i\u22121, c l i) 19: yi = Output(sLi ) 20: i = i+ 1\ngives a hard alignment to the end-of-sentence token, which provides very little information to the decoder. Furthermore, it is possible that an MMA-H attention head stays at the beginning of sentence without moving forward. Such a head would not cause latency issues but would degrade the model quality since the decoder would not have any information about the input. In addition, this behavior is not suited for streaming systems.\nTo address these issues, we introduce two latency control methods. The first one is weighted average latency, shown in Equation 14:\ngWi = exp(gl,hi )\u2211L\nl=1 \u2211H h=1 exp(g l,h i ) gl,hi (14)\nwhere gl,hi = \u2211|x| j=1 j\u03b1i,j . Then we calculate the latency loss with a differentiable latency metric C.\nLavg = C ( gW )\n(15)\nLike Arivazhagan et al. (2019), we use the Differentiable Average Lagging. It is important to note that, unlike the original latency augmented training in Arivazhagan et al. (2019), Equation 15 is not the expected latency metric given C, but weighted average C on all the attentions. The real expected latency is g\u0302 = maxl,h ( gl,h ) instead of g\u0304, but using this directly would only affect the speed of the fastest head. Equation 15 can control every head in a way that the faster heads will be automatically assigned to larger weights and slower heads will also be moderately regularized. For MMA-H models, we found that the latency of are mainly due to outliers that skip almost every token. The weighted average latency loss is not sufficient to control the outliers. We therefore introduce the head divergence loss, the average variance of expected delays at each step, defined in Equation 16:\nLvar = 1\nLH L\u2211 l=1 H\u2211 h=1 ( gl,hi \u2212 g\u0304i )2 (16)\nwhere g\u0304i = 1LH \u2211 gi The final objective function is presented in Equation 17: L(\u03b8) = \u2212 log(y | x; \u03b8) + \u03bbavgLavg + \u03bbvarLvar (17) where \u03bbavg , \u03bbvar are hyperparameters that control both losses. Intuitively, while \u03bbavg controls the overall speed, \u03bbvar controls the divergence of the heads. Combining these two losses, we are able to dynamically control the range of attention heads so that we can control the latency and the reading buffer. For MMA-IL model, we only use Lavg; for MMA-H we only use Lvar. 3 EXPERIMENTAL SETUP\n 3.1 EVALUATION METRICS\nWe evaluate our model using quality and latency. For translation quality, we use tokenized BLEU 4 for IWSLT15 En-Vi and detokenized BLEU with SacreBLEU (Post, 2018) for WMT15 De-En. For latency, we use three different recent metrics, Average Proportion (AP) (Cho & Esipova, 2016), Average Lagging (AL) (Ma et al., 2019) and Differentiable Average Lagging (DAL) (Arivazhagan et al., 2019) 5. We remind the reader of the metric definitions in Appendix A.2. 3.2 DATASETS\nWe evaluate our method on two standard machine translation datasets, IWSLT14 En-Vi and WMT15 De-En. Statistics of the datasets can be found in Table 1. For each dataset, we apply tokenization with the Moses (Koehn et al., 2007) tokenizer and preserve casing.\nIWSLT15 English-Vietnamese TED talks from IWSLT 2015 Evaluation Campaign (Cettolo et al., 2016). We follow the settings from Luong & Manning (2015) and Raffel et al. (2017). We replace words with frequency less than 5 by <unk>. We use tst2012 as a validation set tst2013 as a test set. WMT15 German-English We follow the setting from Arivazhagan et al. (2019). We apply byte pair encoding (BPE) (Sennrich et al., 2016) jointly on the source and target to construct a shared vocabulary with 32K symbols. We use newstest2013 as validation set and newstest2015 as test set. 3.3 MODELS\nWe evaluate MMA-H and MMA-IL models on both datasets. The MILK model we evaluate on IWSLT15 En-Vi is based on Luong et al. (2015) rather than RNMT+ (Chen et al., 2018). In general, our offline models use unidirectional encoders, i.e. the encoder self-attention can only attend to previous states, and greedy decoding. We report offline model performance in Table 2 and the effect of using unidirectional encoders and greedy decoding in Table 3. For MMA models, we replace the encoder-decoder layers with MMA and keep other hyperparameter settings the same as the offline model. Detailed hyperparameter settings can be found in subsection A.1. We use the Fairseq library (Ott et al., 2019) for our implementation.\n4We acquire the data from https://nlp.stanford.edu/projects/nmt/, which is tokenized. We do not have the tokenizer which processed this data, thus we report tokenize d BLEU for IWSLT15\n5Latency metrics are computed on BPE tokens for WMT15 De-En \u2013 consistent with Arivazhagan et al. (2019) \u2013 and on word tokens for IWSLT15 En-Vi.\n6 Luong & Manning (2015) report a BLEU score of 23.0 but they didn\u2019t mention what type of BLEU score they used. This score is from our implementation on the data aquired from https://nlp.stanford.edu/projects/nmt/ 4 RESULTS\nIn this section, we present the main results of our model in terms of latency-quality tradeoffs, ablation studies and analyses. In the first study, we analyze the effect of the variance loss on the attention span. Then, we study the effect of the number of decoder layers and decoder heads on quality and latency. We also provide a case study for the behavior of attention heads in an example. Finally, we study the relationship between the rank of an attention head and the layer it belongs to. 4.1 LATENCY-QUALITY TRADEOFFS\nWe plot the quality-latency curves for MMA-H and MMA-IL in Figure 2. The BLEU and latency scores on the test sets are generated by setting a latency range and selecting the checkpoint with best BLEU score on the validation set. We use differentiable average lagging (Arivazhagan et al., 2019) when setting the latency range. We find that for a given latency, our models obtain a better translation quality. While MMA-IL tends to have a decrease in quality as the latency decreases, MMA-H has a small gain in quality as latency decreases: a larger latency does not necessarily mean an increase in source information available to the model. In fact, the large latency is from the outlier attention heads, which skip the entire source sentence and point to the end of the sentence. The outliers not only increase the latency but they also do not provide useful information. We introduce the attention variance loss to eliminate the outliers, as such a loss makes the attention heads focus on the current context for translating the new target token.\nIt is interesting to observe that MMA-H has a better latency-quality tradeoff than MILk7 even though each head only attends to only one state. Although MMA-H is not yet able to handle an arbitrarily long input (without resorting to segmenting the input), since both encoder and decoder self-attention have an infinite lookback, that model represents a good step in that direction. 4.2 ATTENTION SPAN\nIn subsection 2.3, we introduced the attention variance loss to MMA-H in order to prevent outlier attention heads from increasing the latency or increasing the attention span. We have already evaluated the effectiveness of this method on latency in subsection 4.1. We also want to measure the difference between the fastest and slowest heads at each decoding step. We define the average\n7The numbers of MILk on WMT15 De-En are from Arivazhagan et al. (2019)\nattention span in Equation 18:\nS\u0304 = 1\n|y|  |y|\u2211 i max l,h tl,hi \u2212min l,h tl,hi  (18) It estimates the reading buffer we need for streaming translation. We show the relation between the average attention span versus \u03bbvar in Figure 3. As expected, the average attention span is reduced as we increase \u03bbvar. 4.3 EFFECT ON NUMBER OF LAYERS AND NUMBER OF HEADS\nOne motivation to introduce MMA is to adapt the Transformer, which is the current state-of-the-art model for machine translation, to online decoding. Important features of the Transformer architecture include having a separate attention layer for each decoder layer block and multihead attention. In this section, we test the effect of these two components on the offline, MMA-H, and MMA-IL models from a quality and latency perspective. We report quality as measured by detokenized BLEU and latency as measured by DAL on the WMT13 validation set in Figure 4. We set \u03bbavg = 0.2 for MMA-IL and \u03bbvar = 0.2 for MMA-H.\nThe offline model benefits from having more than one decoder layer. In the case of 1 decoder layer, increasing the number of attention heads is beneficial but in the case of 3 and 6 decoder layers, we do not see much benefit from using more than 2 heads. The best performance is obtained for 3 layers and 2 heads (6 effective heads). The MMA-IL model behaves similarly to the offline model, and the best performance is observed with 6 layers and 4 heads (24 effective heads). For MMA-H, with 1 layer, performance improves with more heads. With 3 layers, the single-head setting is the most effective (3 effective heads). Finally, with 6 layers, the best performance is reached with 16 heads (96 effective heads).\nThe general trend we observe is that performance improves as we increase the number of effective heads, either from multiple layers or multihead attention, up to a certain point, then either plateaus or degrades. This motivates the introduction of the MMA model.\nWe also note that latency increases with the number of effective attention heads. This is due to having fixed loss weights: when more heads are involved, we should increase \u03bbvar or \u03bbavg to better control latency. 4.4 ATTENTION BEHAVIORS\nWe characterize attention behaviors by providing a running example of MMA-H and MMA-IL, shown in Figure 5. Each curve represents the path that an attention head goes through at inference time. For MMA-H, shown in Figure 5a, we found that when the source and target tokens have the same order, the attention heads behave linearly and the distance between fastest head and slowest head is small. For example, this can be observed from partial sentence pair \u201cI also didn\u2019t know that\u201d and target tokens \u201cT\u00f4i c\u0169ng kh\u00f4ng bi\u1ebft r\u1eb1ng\u201d, which have the same order. However, when the source tokens and target tokens have different orders, such as \u201cthe second step\u201d and \u201cb\u01b0\u1edbc (step) th\u1ee9 hai (second)\u201d, the model will generate \u201cb\u01b0\u1edbc (step)\u201d first and some heads will stay in the past to retain\nthe information for later reordered translation \u201cth\u1ee9 hai (second)\u201d. We can also see that the attention heads have a near-diagonal trajectory, which is appropriate for streaming inputs.\nThe behavior of the heads in MMA-IL models is shown in Figure 5b. Notice that we remove the partial softmax alignment in this figure. We don\u2019t expect streaming capability for MMA-IL: some heads stop at early position of the source sentence to retain the history information. Moreover, because MMA-IL has more information when generating a new target token, it tends to produce translations with better quality. In this example, the MMA-IL model has a better translation on \u201cisolate the victim\u201d than MMA-H (\u201cl\u00e0 c\u00f4 l\u1eadp n\u1ea1n nh\u00e2n\u201d vs \u201cl\u00e0 t\u00e1ch bi\u1ec7t n\u1ea1n nh\u00e2n\u201d) 4.5 RANK OF THE HEADS\nIn Figure 6, we calculate the average and standard deviation of rank of each head when generating every target token. For MMA-IL, we find that heads in lower layers tend to have higher rank and are thus slower. However, in MMA-H, the difference of the average rank are smaller. Furthermore, the standard deviation is very large which means that the order of the heads in MMA-H changes frequently over the inference process. 6 CONCLUSION\nIn this paper, we propose two variants of the monotonic multihead attention model for simultaneous machine translation. By introducing two new targeted loss terms which allow us to control both latency and attention span, we are able to leverage the power of the Transformer architecture to achieve better quality-latency trade-offs than the previous state-of-the-art model. We also present detailed ablation studies demonstrating the efficacy and rationale of our approach. By introducing these stronger simultaneous sequence-to-sequence models, we hope to facilitate important applications, such as high-quality real-time interpretation between human speakers. A APPENDIX\nA.1 HYPERPARAMETERS\nThe hyperparameters we used for offline and monotonic transformer models are defined in Table 4.\nA.2 LATENCY METRICS DEFINITIONS\nGiven the delays g = {g1, g2, ..., g|y|} of generating each target token, AP, AL and DAL are defined in Table 5.\nA.3 DETAILED RESULTS\nWe provide the detailed results in Figure 2 as Table 6 and Table 7.\nA.4 THRESHOLD OF READING ACTION\nWe explore a simple method that can adjust system\u2019s latency at inference time without training new models. In Algorithm 1 line 8, 0.5 was used as an threshold. One can set different threshold p during the inference time to control the latency. We run the pilot experiments on IWSLT15 En-Vi dataset and the results are shown as Table 8. Although this method doesn\u2019t require training new model, it dramatically hurts the translation quality.\nA.5 AVERAGE LOSS FOR MMA-H\nWe explore applying a simple average instead of a weighted average loss to MMA-H. The results are shown in Figure 7 and Table 9. We find that even with very large weights, we are unable to reduce the overall latency. In addition, we find that the weighted average loss severely affects the translation quality negatively. On the other hand, the divergence loss we propose in Equation 16 can efficiently reduce the latency while retaining relatively good translation quality for MMA-H models.\nWMT15 Germen-English", "source": "paper_review", "evaluation": "LLM"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Multi-scale dense networks for resource efficient image classification , Authors: Gao Huang Danlu Chen... , 2017\n2.  Title: Densely connected convolutional networks , Authors: Gao Huang Zhuang Liu... , 2017\n3.  Title: Condensenet: An efficient densenet using learned group , Authors: Gao Huang Shichen Liu... , 2018\n4.  Title: Blockdrop: Dynamic inference paths in residual networks. 2018 , Authors: Zuxuan Wu Tushar Nagarajan... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Deep neural networks have achieved outstanding performance in many real-world applications with the expense of huge computational resources. The DenseNet, one of the recently proposed neural network architecture, has achieved the state-of-the-art performance in many visual tasks. However, it has great redundancy due to the dense connections of the internal structure, which leads to high computational costs in training such dense networks. To address this issue, we design a reinforcement learning framework to search for efficient DenseNet architectures with layer-wise pruning (LWP) for different tasks, while retaining the original advantages of DenseNet, such as feature reuse, short paths, etc. In this framework, an agent evaluates the importance of each connection between any two block layers, and prunes the redundant connections. In addition, a novel reward-shaping trick is introduced to make DenseNet reach a better trade-off between accuracy and float point operations (FLOPs). Our experiments show that DenseNet with LWP is more compact and efficient than existing alternatives.", "Huang et al. (2018) proposed group convolution to remove connections between layers in DenseNet for which this feature reuse is superfluous; Huang et al. (2017a) also suggested progressively update prediction for every test sample to unevenly adapt the amount of computational resource at inference time. The most related work is BlockDrop (Wu et al., 2018), which used reinforcement learning to prune weight dynamically at inference time but can only be applied to ResNet or its variants. In contrast, our approach is based on DenseNet, aiming to find efficient network structure based the densely connected features of DenseNet.", "The paper introduces RL based approach to prune layers in a DenseNet. This work extends BlockDrop to DenseNet architecture making the controller independent form the input image. The approach is evaluated on CIFAR10 and CIFAR100 datasets as well as on ImageNet showing promising results.\n\nIn order to improve the paper, the authors could take into consideration the following points:\n1. Given the similarity of the approach with BlockDrop, I would suggest to discuss it in the introduction section clearly stating the similarities and the differences with the proposed approach. \n2. BlockDrop seems to introduce a general framework of policy network to prune neural networks. However, the authors claim that BlockDrop \"can only be applied to ResNets or its variants\". Could the authors comment on this? \n3. In the abstract, the authors claim: \"Our experiments show that DenseNet with LWP is more compact and efficient than existing alternatives\". It is hard to asses if the statement is correct given the evidence presented in the experimental section. It is not clear if the method is more efficient and compact than others, e. g.  CondenseNet. \n4. In the experimental section, addressing the following questions would make the section stronger: What is more important FLOPs or number of parameters? What is the accuracy drop we should allow to pay for reduction in number of parameters or FLOPs?\n5. For the evaluation, I would suggest to show that the learned policy is better than a random one: e. g. not using the controller to define policy (in line 20 of the algorithm) and using a random random policy instead.\n6. In Table 1, some entries for DenseNet LWP are missing. Is the network converging for this setups? \n7. \\sigma is not explained in section 3.3. What is the intuition behind this hyper parameter?\n\n\nAdditional comments:\n1. It might be interesting to compare the method introduced in the paper to a scenario where the controller is conditioned on an input image and adaptively selects the connections/layers in DenseNet at inference time.\n2. It might be interesting to report the number of connections in Table 1 for all the models.\n\n4: Ok but not good enough - rejection"], "input": "1 INTRODUCTION\nDeep neural networks are increasingly used on mobile devices, where computational resources are quite limited(Chollet, 2017; Sandler et al., 2018; Zhang et al., 2017; Ma et al., 2018). Despite the success of deep neural networks, it is very difficult to make efficient or even real-time inference on low-end devices, due to the intensive computational costs of deep neural networks. Thus, the deep learning community has paid much attention to compressing and accelerating different types of deep neural networks(Gray et al., 2017).\nAmong recently proposed neural network architectures, DenseNet (Huang et al., 2017b) is one of the most dazzling structures which introduces direct connections between any two layers with the same featuremap size. It can scale naturally to hundreds of layers, while exhibiting no optimization difficulties. In addition, it achieved state-of-the-art results across several highly competitive datasets. However, recent extensions of Densenet with careful expert design, such as Multi-scale DenseNet(Huang et al., 2017a) and CondenseNet(Huang et al., 2018), have shown that there exists high redundancy in DenseNet. Our paper mainly focuses on how to compress and accelerate the DenseNet with less expert knowledge on network design.\nA number of approaches have been proposed to compress deep networks. Generally, most approaches can be classified into four categories: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation(Gray et al., 2017). Unlike these approaches requiring intensive expert experience, automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Liu et al., 2018; Zoph & Le, 2016; Pham et al., 2018; Liu et al., 2017a; Brock et al., 2017).\nHowever, none of these NAS methods are efficient for DenseNet due to the dense connectivity between layers. It is thus interesting and important to develop an adaptive strategy for searching an on-demand neural network structure for DenseNet such that it can satisfy both computational budget and inference accuracy requirement.\nTo this end, we propose a layer-wise pruning method for DenseNet based on reinforcement learning. Our scheme is that an agent learns to prune as many as possible weights and connections while maintaining good accuracy on validation dataset. As illustrated in Figure 1, our agent learns to output a sequence of actions and receives reward according to the generated network structure on validation datasets. Additionally, our agent automatically generates a curriculum of exploration, enabling effective pruning of neural networks.\nExtensive experiments on several highly competitive datasets show that our method largely reduces the number of parameters as well as flops, while maintaining or slightly degrading the prediction performance, such that the corresponding network architecture can adaptively achieve a balance between inference accuracy and computational resources. 2 BACKGROUND\nWe first introduce reinforcement learning and policy gradient in Section 2.1, and DenseNet in Section 2.2, and finally neural architecture search in Section 2.3. 2.1 REINFORCEMENT LEARNING AND POLICY GRADIENT\nReinforcement learning considers the problem of finding an optimal policy for an agent that interacts with an uncertain environment and collects reward per action(Sutton et al., 1998). The goal of the agent is to maximize the long-term cumulative reward. Formally, this problem can be formulated as a Markov decision process over the environment states s \u2208 S and agent actions a \u2208 A, under an unknown environmental dynamic defined by a transition probability T (s\u2032|s, a) and a reward signal r(s, a) immediately following the action a performed at state s. The agent\u2019s action a is selected by a conditional probability distribution \u03c0(a|s) called policy or actor. In policy gradient methods, we consider a set of candidate policies \u03c0\u03b8(a|s) parameterized by \u03b8 and obtain the optimal policy by maximizing the expected cumulative reward or return\nJ(\u03b8) = Es\u223c\u03c1\u03c0,a\u223c\u03c0(a|s) [r(s, a)] , (1)\nwhere \u03c1\u03c0(s) = \u2211\u221e t=1 \u03b3\nt\u22121Pr(st = s) is the normalized discounted state visit distribution with a discount factor \u03b3 \u2208 [0, 1). To simplify the notation, we denote Es\u223c\u03c1\u03c0,a\u223c\u03c0(a|s)[\u00b7] by simply E\u03c0[\u00b7] in the rest of paper.\nAccording to the policy gradient theorem(Sutton et al., 1998), the gradient of J(\u03b8) can be written as\n\u2207\u03b8J(\u03b8) = E\u03c0 [\u2207\u03b8 log \u03c0(a|s)Q\u03c0(s, a)] , (2) where Q\u03c0(s, a) = E\u03c0 [\u2211\u221e t=1 \u03b3 t\u22121r(st, at)|s1 = s, a1 = a ]\ndenotes the expected return under policy \u03c0 after taking an action a in state s, which is also called critic.\nSince the expectation in Eq (3) is over action, it is helpful to estimate a value function V (s) and subtract it from Q(s, a) to reduce variance while keeping unbiased.\n\u2207\u03b8J(\u03b8) = E\u03c0 [\u2207\u03b8 log \u03c0(a|s) (Q\u03c0(s, a)\u2212 V (s))] . (3)\nThe most straightforward way is to simulate the environment with the current policy \u03c0 to obtain a trajectory {(st, at, rt)}nt=1 and estimate \u2207\u03b8J(\u03b8) using the Monte Carlo estimation:\n\u2207\u0302\u03b8J(\u03b8) = 1\nn n\u2211 t=1 \u03b3t\u22121\u2207\u03b8 log \u03c0(at|st) ( Q\u0302\u03c0(st, at)\u2212 V\u0302 (st) ) , (4)\nwhere Q\u0302\u03c0(st, at) is an empirical estimate of Q\u03c0(st, at), e.g., Q\u0302\u03c0(st, at) = \u2211 j\u2265t \u03b3\nj\u2212trj , and V\u0302 (st) is an empirical estimate of V (s). 2.2 DENSENET\nDensely connected networks(Huang et al., 2017b) consist of multiple dense blocks, each of which also consists of multiple layers. Each layer produces k features maps, where k is referred to the growth rate of the network. The distinguishing property of DenseNets is that the input of each layer is a concatenation of all feature maps generated by all preceding layers within the same dense block.\nInside every dense block, the first transformation is a composition of batch normalization(BN) and rectified linear units(RELU), followed by the first convolutional layer in the sequence which reduces the number of channels to save computational cost by using the 1 \u00d7 1 filters. The output is then followed by another BN-ReLU combination transformation and is then reduced to the final k output features through a 3 \u00d7 3 convolution. 2.3 NEURAL ARCHITECTURE SEARCH\nNeural Architecture Search(NAS) is a method for automated design of neural network structures, with the aid of either evolutionary algorithms(Xie & Yuille, 2017; Real et al., 2017) or reinforcement learning (Baker et al., 2016; Cai et al., 2018; Zhong et al., 2017; Zoph & Le, 2016; Zoph et al., 2017). When using reinforcement learning, the agent performs a sequence of actions, which specifies a network structure; this network is then trained and its corresponding validation performance is returned as the reward to update the agent. 3 METHOD\nWe analyze the dense connections of DenseNet in Section 3.1, then we model the layer-wise pruning as a Markov decision process (MDP)(Puterman, 2014) and design a Long-short term memory( LSTM)(Hochreiter & Schmidhuber, 1997) controller to generate inference paths in Section 3.2. The interaction between the agent (i.e., the LSTM controller) and the environment (i.e., the DenseNet) is described in Figure 2. The reward shaping technique in our method is introduced in Section 3.3. Finally, we show the complete training process of LWP in Section 3.4. 3.1 PRETRAINED DENSE CONVOLUTIONAL NETWORKS\nVanilla DenseNet consists of four parts: the first convolution layer, multiple dense blocks, transition layers and finally the fully-connected layer. The first convolution layer is only for feature extraction from raw data. As for the multiple dense blocks, each dense block consists of multiple layers. The transition layers are used as down-sampling layers to change the size of feature maps and the last full-connected layer is used for image classification. Obviously, the dense connections are mainly reflected on the dense blocks. Therefore, we study the connection policy for dense layers in this paper. 3.2 GENERATE INFERENCE PATHS WITH AN LSTM CONTROLLER\nSuppose the DenseNet has L layers, the controller needs to make K (equal to the number of layers in dense blocks) decisions. For layer i, we specify the number of previous layers to be connected in the range between 0 and ni (ni = i). All possible connections among the DenseNet constitute the action space of the agent. However, the time complexity of traversing the action space is O( \u220fK i=1 2\nni), which is NP-hard and unacceptable for DenseNet(Huang et al., 2017b). Fortunately, reinforcement learning is good at solving sequential decision optimization problems and we model the network pruning as a Markov Decision Process(MDP). Since these hierarchical connections have time-series dependencies, it is natural to train LSTM as the controller to simply solve the above-mentioned issue.\nAt the first time step, the LSTM controller receives an empty embedding vector as the input that is regarded as the fixed state s of the agent, and the output of the previous time step is the input for the next time step. Each output neuron in the LSTM is equipped with \u03b4(x) = 11+e\u2212x , so that the output oi defines a policy pi,ai of keeping or dropping connections between the current layer and its previous layers as an ni-dimensional Bernoulli distribution:\noi = f(s; \u03b8c), pi,ai = \u220fni\nj=1 o aij ij (1\u2212 oij) (1\u2212aij), (5)\nwhere f denotes the controller parameterized with \u03b8c. The j-th entry of the output vector oi, denoted by oij \u2208 [0, 1], represents the likelihood probability of the corresponding connection between the i-th layer and\nthe j-th layer being kept. The action ai \u2208 {0, 1}ni is sampled from Bernoulli(oi). aij = 1 means keeping the connection, otherwise dropping it. There are total ni connections for the i-th layer, but the output dimension of LSTM at each time step is K. To unify the action space dimension and LSTM output dimension, we set both to K and the output of each time step take a mask \u2208 {0, 1}K operation, where the mask numbers from 1-th to ni-th element are 1 and others are 0. Finally, the probability distribution of the whole neural network architecture is formed as:\n\u03c0(a1:K |s; \u03b8c) = \u220fK\ni=1 pi,ai (6) 3.3 REWARD SHAPING\nReward shaping is introduced to help the controller make progress to an optimal solution. The reward function is designed for each sample and not only considers the prediction correct or not, but also encourages less computation:\nR(a) = { 1\u2212 \u03b7\u03b1 if predict correctly \u2212\u03b3 otherwise. (7)\nwhere \u03b7 = SUBFLOPsFLOPs measures the percentage of float operations utilized. SUBFLOPs, FLOPs represent the float point operations of the child network and vanilla DenseNet, respectively. In order to maximize the reward, the prediction needs to be correct and SUBFLOPs should be reduced as much as possible. The trade-off between performance and complexity is mainly controlled by \u03b1 and \u03b3 and more details will be discussed in the Section 7.3 of the appendix. 3.4 TRAINING WITH ADVANTAGE ACTOR-CRITIC\nAfter obtaining the feedback from the child network, we modify the Eq (1) as the following expected reward:\nJ(\u03b8c) = Ea\u223c\u03c0\u03b8c [r(s, a)] (8) To maximize Eq (8) and accelerate policy gradient training over \u03b8c, we utilize the advantage actor-critic(A2C) with an estimation of state value function V (s; \u03b8v) to derive the gradients of J(\u03b8c) as:\n\u2207\u03b8cJ(\u03b8c) = \u2211 a (r(s, a)\u2212 V (s; \u03b8v))\u03c0(a|s, \u03b8c)\u2207\u03b8c log \u03c0(a|s, \u03b8c) (9)\nThe Eq (9) can be approximated by using the Monte Carlo sampling method:\n\u2207\u03b8cJ(\u03b8c) = 1\nn n\u2211 t=1 ( r(t)(s, a)\u2212 V (s; \u03b8v) ) \u2207\u03b8c log \u03c0(a|s, \u03b8c) (10)\nwhere n is the batch size. The mini-batch samples share the same child network and perform forward propagation in parallel. Therefore, they have the same policy distribution \u03c0(a|s, \u03b8c) but different r(s, a). We further improve exploration to prevent the policy from converging to suboptimal deterministic policy by adding the entropy of the policy \u03c0(a|s, \u03b8c),H(\u03c0(a|s, \u03b8c)) to the objective function. The gradient of the full objective function takes the form:\n\u2207\u03b8cJ(\u03b8c) = 1\nn n\u2211 t=1 [( r(t)(s, a)\u2212 V (s, \u03b8v) ) \u2207\u03b8c log \u03c0(a|s, \u03b8c) + \u03b2\u2207\u03b8cH(\u03c0(a|s, \u03b8c)) ] (11)\nAs for the value network, we define the loss function as Lv and utilize gradient descent methods to update \u03b8v:\nLv = 1\nn n\u2211 t=1 ( r(t)(s, a)\u2212 V (s; \u03b8v) )2 , \u2207\u03b8vLv = 2 n n\u2211 t=1 ( r(t)(s, a)\u2212 V (s; \u03b8v) ) \u2202V (s; \u03b8v) \u2202\u03b8v (12)\nThe entire training procedure is divided into three stages: curriculum learning, joint training and training from scratch. Algorithm 1 shows the complete recipe for layer-wise pruning.\nCurriculum learning. It is easy to note that the search space scales exponentially with the block layers of DenseNet and there are total \u220fK i=1 2\nni keeping/dropping configurations. We use curriculum learning(Bengio, 2013) like BlockDrop(Wu et al., 2018) to solve the problem that policy gradient is sensitive to initialization. For epoch t (1 \u2264 t < K), the LSTM controller only learns the policy of the last t layers and keeps the policy of the remaining K \u2212 t layers consistent with the vanilla DenseNet. As t \u2265 K, all block layers are involved in the decision making process.\nJoint training. The previous stage just updates parameters \u03b8c and \u03b8v. The controller learns to identify connections between two block layers to be kept or dropped. However, it prevents the agent from learning the optimal architecture. Jointly training the DenseNet and controller can be employed as the next stage so that the controller guides the gradients of \u03b8v to the direction of dropping more connections.\nTraining from scratch. After joint training, several child networks can be sampled from the policy distribution \u03c0(a|s, \u03b8c) and we select the child network with the highest reward to train from scratch, and thus better experimental results have been produced.\nWe summarize the entire process in Algorithm 1. 5 EXPERIMENT\nWe evaluate the LWP method on three benchmarks: CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) and ImageNet 2012 (Deng et al., 2009) and these three datasets are used for image classification. Details of experiments and hyperparameters setting in Appendix 7.3. 5.1 RESULTS ON CIFAR\nPretrained DenseNet. For CIFAR datasets, DenseNet-40-12 and DenseNet-100-12 are selected as the backbone CNN. During the training time, the backbone CNN needs to make predictions with dynamic computation paths. In order to make the backbone CNN adjust to our algorithm strategy, we reproduced the DenseNet-40-12 and DenseNet-100-12 on CIFAR based on Pytorch (Paszke et al., 2017) and the results are shown in Table 1.\nComparisons and analysis. The results on CIFAR are reported in Table 1. For CIFAR-10 dataset and the vanilla DenseNet-40-12, our method has reduced the amounts of FLOPs, parameters by nearly 81.4%, 78.2%, respectively and the test error only increase 1.58%. The exponential power \u03b1 and penalty \u03b3 can be tuned to improve the performance. In this experiment, we just modify hyperparameter \u03b1 from 2 to 3 so that the model\ncomplexity(105M vs 173M FLOPs) is increased while test error rate is reduced to 6.00%.The same law can be observed on the DenseNet-100-12 with LWP. Our algorithm also has advantages on Condensenet (Huang et al., 2018) which needs more expert knowledge and NAS (Zoph & Le, 2016) which takes much search time complexity and needs more parameters but gets higher test error.\nWe can also observe the results on CIFAR-100 from the Table 1 that the amounts of FLOPs in DenseNet with LWP are just nearly 46.5%, 66.3% of the DenseNet-40-12 and DenseNet-100-12. The compression rates are worse than that for CIFAR-10. This may be caused by the complexity of the CIFAR-100 classification task. The more hard task, the more computation is needed. 5.2 RESULTS ON IMAGENET\nPretrained DenseNet. We compress the DenseNet-121-32 which has four dense blocks([6, 12, 24, 16]) on ImageNet. The growth rate of DenseNet-121-32 is 32 and this neural network architecture is equipped with bottleneck layers and compression ratio fixed at 0.5 that are designed to improve the model compactness. In the following section, we prove that the model can be further compressed. This model is initialized by loading the checkpoint file of pretrained model from Pytorch.\nMake comparisons and analysis. Although the bottleneck layer and compression ratio are introduced in DenseNet-121-32, the result shows that there is still much redundancy. As observed from Table 2, we can still reduce 54.7% FLOPs and 35.2% parameters of the vanilla DenseNet-121-32 with 1.84% top-1 and 1.28% top-5 test error increasing. 5.3 QUANTITATIVE RESULTS\nIn this section, we argue that our proposed methods can learn more compact neural network architecture by analyzing the number of input channel in DenseNet layer and the connection dependency between a convolution layer with its preceding layers.\nIn Figure 3 left, the red bar represent the number of input channel in DenseNet-40-12-LWP (D40-12-LWP) and the blue bar represent the number of input channel in vanilla DenseNet. We can observe that the number of input channels grows linearly with the layer index because of the concatenation operation and D40-12-LWP has layer-wise input channels identified by the controller automatically. The input channel is 0 means this layer is dropped so that the block layers is reduced from 36 to 26. The number of connections between a layer with its preceding layers can be obtained from the right panel of Figure 3. In Figure 3 right, the x, y axis define the target layer t and source layer s. The small square at position (s, t) represents the connection dependency of target layer t on source layer s. The pixel value of position (s, t) is evaluated with the average absolute filter weights of convolution layers in D40-12-LWP. One small square means one connection and the number of small squares in the vertical direction indicates the number of connections to target layer t.\nAs reported by the paper DenseNet(Huang et al., 2017b), there are redundant connections because of the low kernel weights on average between some layers. The right panel of Figure 3 obviously shows that the values of these small square connecting the same target layer t are almost equal which means the layer t almost has the same dependency on different preceding layers. Naturally, we can prove that the child network learned from vanilla DenseNet is quite compact and efficient. 6 CONCLUSION\nWe propose an algorithm strategy to search efficient child network of DenseNet with reinforcement learning agent. The LSTM is used as the controller to layer-wise prune the redundancy connections. The whole process is divided into three stages: curriculum learning, joint training and training from scratch. The extensive experiments based on CIFAR and ImageNet show the effectiveness of our method. Analyzing the child network and the filter parameters in every convolution layer prove that our proposed method can learn to search compact and efficient neural network architecture. 7 APPENDIX\n 7.1 DATASETS AND EVALUATION METRICS\nCIFAR-10 and CIFAR-100 consists of 10 and 100 classes images with 32\u00d7 32 RGB pixels. Both datasets contain 60, 000 images, of which 50, 000 images for training sets and 10, 000 images for test sets. We use a standard data pre-processing and augmentation techniques and the complete procedure is: normalize the data by using the channel means and standard deviations, centrally pad the training images with size 4, randomly crop to restore 32\u00d7 32 images and randomly flip with probability 0.5 horizontally. The evaluation metric in CIFAR is the prediction accuracy.\nThere are total 1.33 million colored images with 1000 visual classes in ImageNet, 1.28 million for training images and 50k for validation images. We also adopt the data-augmentation scheme for pre-processing, ie: resize the images to 256\u00d7256, normalize the images using channel means and standard derivations, randomly crop to 224\u00d7 224 and flip horizontally at training time but apply a center crop with size 224\u00d7 224 at test time. The performance in ImageNet is measured by both top-1 and top-5 prediction accuracy. 7.2 TRAINING CONFIGURATIONS\nTraining configurations for CIFAR. Based on the pretrained DenseNet, the LSTM controller is trained with batch size 128 for 1000 epochs during the curriculum learning procedure and ADAM optimizer without weight decay is adopt. The learning rate starts from 10\u22123 and it is lowered by 10 times at epoch 500 and 750. For the joint training, we fix the learning rate at 10\u22124 and finetune the model for 1000 epochs. Then we select the optimal child network with highest reward to train from scratch. In the last stage, the SGD optimizer with a weight decay of 10\u22124 and a Nesterov momentum of 0.9 without dampening is adopt. We train the optimal child network with mini-batch size 64 and a cosine shape learning rate from 0.1 to 0 for 300 epochs.\nTraining configurations for ImageNet. For curriculum learning and joint training, we set the epochs 90, 50 respectively and batch size 1024. In curriculum learning procedure, the learning rate is set to 1e\u22123 and is lowered 10 times in epoch 45 and 75. The learning rate is fixed at 1e\u22124 for joint training procedure. We use the same optimizer parameters as CIFAR experiments. At last, the learned optimal child network is optimized like DenseNet(Huang et al., 2017b). 7.3 HYPERPARAMETERS SEARCH\nWe use reward shaping technique in our model and the detailed reward formulation is defined in Eq (7). The trade-off between the model performance and complexity can be controlled by adjusting different reward functions. As shown in Eq (7), the reward function mainly depends on the exponential power \u03b1 of FLOPs ratio and the penalty \u2212\u03b3. We mainly analyzed and explored these two factors of the child network (D40-12-LWP) based on DenseNet-40-12 (D40-12) and CIFAR-10 dataset in the following section.\nExponential power. Given one policy, pass a image to the child network and we hope to get higher reward if the prediction is correct. The lower the FLOPs of the child network, the larger the reward value if fix \u03b1. On the contrary, in order to get the same reward value, the exponential power \u03b1 is bigger and the model complexity is larger. As shown in Figure 4 (a) (b), with setting exponential power \u03b1 = 1/3, 1/2, 1, 2, 3 and fixing \u03b3 = 0.5, the classification accuracy and FLOPs of the child network also increase gradually.\nPenalty. Considering the incorrect prediction, the penalty \u2212\u03b3 is given as the feedback. The bigger \u03b3 means the controller emphasis on the model performance rather than the reduced model complexity. \u03b3 is set to\n0.25, 0.5, 0.75, 1 and exponential power \u03b1 is fixed at 2, respectively. The results is shown in Figure 4 (c) (d) and we can observe that both curves are increased first and then decreased. 7.4 ALGORITHM\nAlgorithm 1 The pseudo-code for layer-wise pruning. Input: Training dataset Dt; Validation dataset Dv; Pretrained DenseNet.\nInitialize the parameters \u03b8c of the LSTM controller and \u03b8v of the value network randomly. Set epochs for curriculum learning, joint training and training from scratch to Mcl, M jt and Mfs respectively and sample Z child networks.\nOutput: The optimal child network 1: //Curriculum learning 2: for t = 1 to Mcl do 3: o = f(s; \u03b8c) 4: if t < K \u2212 t then 5: for i = 1 to K \u2212 t do 6: o[i, 0 : i] = 1 7: o[i, i :] = 0 8: end for 9: end if\n10: Sample a from Bernoulli(o) 11: DenseNet with policy makes predictions on the training dataset Dt 12: Calculate feedback R(a) with Eq (7) 13: Update parameters \u03b8c and \u03b8v with Eq (11) and Eq (12) respectively 14: end for 15: //Joint training 16: for t = 1 to M jt do 17: Simultaneously train DenseNet and the controller 18: end for 19: for t = 1 to Z do 20: Sample a child network from \u03c0(a|s, \u03b8c) 21: Execute the child network on the validation dataset Dv 22: Obtain feedback R(t)(a) with Eq (7) 23: end for 24: Select the child networkN with highest reward 25: //Training from scratch 26: for t = 1 to Mfs do 27: Train the child networkN from scratch 28: end for 29: return The optimal child networkN", "source": "paper_review", "evaluation": "LLM"}
{"instructions": ["Please generate the Abstract section for this paper.", "Please generate the Related Work section for this paper, and you have to cite these papers:\n1.  Title: Mixture density networks , Authors: Christopher M Bishop , 1994\n2.  Title: Weight uncertainty in neural networks , Authors: Charles Blundell Julien Cornebise... , 2015\n3.  Title: A weakly informative default prior distribution for logistic and other regression models , Authors: Andrew Gelman Aleks Jakulin... , 2008\n4.  Title: Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper) , Authors: Andrew Gelman , 2006\n5.  Title: Variational dropout and the local reparameterization trick , Authors: Durk P Kingma Tim Salimans... , 2015\n6.  Title: Simple and scalable predictive uncertainty estimation using deep ensembles , Authors: Balaji Lakshminarayanan Alexander Pritzel... , 2017\n7.  Title: Variational dropout sparsifies deep neural networks , Authors: Dmitry Molchanov Arsenii Ashukha... , 2017\n8.  Title: Evidential deep learning to quantify classification uncertainty , Authors: Murat Sensoy Lance Kaplan... , 2018\n", "Please write a review for this paper and you should provide some suggestions and raise some questions in your review."], "outputs": ["Deterministic neural networks (NNs) are increasingly being deployed in safety critical domains, where calibrated, robust and efficient measures of uncertainty are crucial. While it is possible to train regression networks to output the parameters of a probability distribution by maximizing a Gaussian likelihood function, the resulting model remains oblivious to the underlying confidence of its predictions. In this paper, we propose a novel method for training deterministic NNs to not only estimate the desired target but also the associated evidence in support of that target. We accomplish this by placing evidential priors over our original Gaussian likelihood function and training our NN to infer the hyperparameters of our evidential distribution. We impose priors during training such that the model is penalized when its predicted evidence is not aligned with the correct output. Thus the model estimates not only the probabilistic mean and variance of our target but also the underlying uncertainty associated with each of those parameters. We observe that our evidential regression method learns well-calibrated measures of uncertainty on various benchmarks, scales to complex computer vision tasks, and is robust to adversarial input perturbations.", "Uncertainty estimation has a long history in neural networks, from modeling probability distribution parameters over outputs (Bishop, 1994) to Bayesian deep learning (Kendall & Gal, 2017). Our work builds on this foundation and presents a scalable representation for inferring the parameters of an evidential uncertainty distribution while simultaneously learning regression tasks via MLE. In Bayesian deep learning, priors are placed over network weights and estimated using variational inference (Kingma et al., 2015). Dropout (Gal & Ghahramani, 2016; Molchanov et al., 2017) and BBB (Blundell et al., 2015) rely on multiple samples to estimate predictive variance. Ensembles (Lakshminarayanan et al., 2017) provide a tangential approach where sampling occurs over multiple trained instances. In contrast, we place uncertainty priors over the likelihood function and thus only need a single forward pass to evaluate both prediction and uncertainty. Additionally, our approach of uncertainty estimation proved to be better calibrated and capable of predicting where the model fails. A large topic of research in Bayesian inference focuses on placing prior distributions over hierarchical models to estimate uncertainty (Gelman et al., 2006; 2008). Our methodology falls under the class of evidential deep learning which models higher-order distribution priors over neural network predictions to interpret uncertainty. Prior works in this field (Sensoy et al., 2018; Malinin & Gales, 2018) have focused exclusively on modeling uncertainty in the classification domain with Dirichlet prior distributions. Our work extends this field into the broad range of regression learning tasks (e.g. depth estimation, forecasting, robotic control learning, etc.) and demonstrates generalizability to out-of-distribution test samples and complex learning problems.", "This paper proposes a novel approach to estimate the confidence of predictions in a regression setting. The approach starts from the standard modelling assuming iid samples from a Gaussian distribution with unknown mean and variances and places evidential priors. This opens the door to online applications with fully integrated uncertainty estimates. \n\nPros:\n1.\tNovel approach to regression (a similar work has been published at NeurIPS last year for classification [3]), but the extension of the work to regression is important.\n2.\tThe experimental results show consistent improvement in performance over a wide base of benchmarks, scales to large vision problems and behaves robustly against adversarial examples.\n3.\tThe presentation of the paper is overall nice, and the Figures are very useful to the general comprehension of the article.\nCons:\n1.\tThe theory of evidence, which is not widely known in the ML community, is not clearly introduced. \nI think that the authors should consider adding a section similar to Section 3 of Sensoy et al. [3] should be considered. Currently, the only step explaining the evidential approach that I found was in section 3.1, in a very small paragraph (between \u201cthe mean of [\u2026] to \\lambda + 2\\alpha.\u201d). I believe that the article would greatly benefit from a more thorough introduction of concepts linked to the theory of evidence.\n2.\tThe authors briefly mention that KL is not well defined between some NIG distributions (p.5) and propose a custom evidence regularizer, but there\u2019s very little insight given on how this connects to/departs from the ELBO approach. \n\nOther comments/questions:\n1.\t(p.1)  I\u2019m not sure to fully understand what\u2019s meant by higher-order/lower-order distributions, could you clarify?\n2.\t(p.3) In section 3.1, the term in the total evidence \\phi_j is not defined.\n3.\t(p.3) Could you comment on the implications of assuming that the estimated distribution can be factorized? \n4.\t(p.4) Could you comment on the difference that there is between NLL_ML and NLL_SOS from a modelling perspective?\n5.\t(p.4) The ELBO loss (6) is unclearly defined, and not connected to the direct context. I would suggest moving this to the section 3.3, where the prior p(\\theta) used in eq. (6) is actually defined.\n\nSpelling & typos:\n-\t(p.4) There are several typos in equation (8), where tau should be replaced with 1/\\sigma^2. \n-\t(p.8) In the last sentence, there is \u201cntwork\u201d instead of network.\n-\t(p.9) There is a typo in the name of J\u00f8sang in the references. \n-\t(p.10) In equation (13), due to the change of variable, there should be a \n-(1/\\tau^2) added;  \n-\t(p.10) In equation (14), the \\exp(-\\lambda*\\pi*(\u2026)) should be replaced with \\exp(-\\lambda*\\tau*(\u2026)). \n\n6: Weak Accept"], "input": "2 MODELLING UNCERTAINTIES FROM DATA\n 2.1 PRELIMINARIES\nConsider the following supervised optimization problem: given a dataset, D, of N paired training examples, (x1, y1), . . . , (xN , yN ), we aim to learn a function f , parameterized by a set of weights, w, which approximately solves the following optimization problem:\nmin w\nJ(w); J(w) = 1\nN N\u2211 i=1 Li(w), (1)\nwhere Li(\u00b7) describes a loss function. In this work, we consider deterministic regression problems, which commonly optimize the sum of squared errors, Li(w) = 12 \u2016yi \u2212 f(xi;w)\u2016\n2. In doing so, the model is encouraged to learn the average correct answer for a given input, but does not explicitly model any underlying noise or uncertainty in the data when making its estimation. 2.2 MAXIMUM LIKELIHOOD ESTIMATION\nWe can also approach our optimization problem from a maximum likelihood perspective, where we learn model parameters that maximize the likelihood of observing a particular set of training data. In the context of deterministic regression, we assume our targets, yi, were drawn i.i.d. from a Gaussian distribution with mean and variance parameters \u03b8 = (\u00b5, \u03c32). In maximum likelihood estimation, we aim to learn a model to infer \u03b8 = (\u00b5, \u03c32) that maximize the likelihood of observing our targets, y, given by p(yi|\u03b8). In practice, we minimize the negative log likelihood by setting:\nLi(w) = \u2212 log p(yi|\u00b5, \u03c32\ufe38 \ufe37\ufe37 \ufe38 \u03b8 ) = 1 2 log(2\u03c0\u03c32) + (yi \u2212 \u00b5)2 2\u03c32 . (2)\nIn learning the parameters \u03b8, this likelihood function allows us to successfully model the uncertainty of our data, also known as the aleatoric uncertainty. However, our model remains oblivious to the predictive model or epistemic uncertainty (Kendall & Gal, 2017).\nIn this paper, we present a novel approach for estimating the evidence in support of network predictions by directly learning both the inferred aleatoric uncertainty as well as the underlying epistemic uncertainty over its predictions. We achieve this by placing higher-order prior distributions over the learned parameters governing the distribution from which our observations are drawn. 3 EVIDENTIAL UNCERTAINTY FOR REGRESSION\n 3.1 PROBLEM SETUP\nWe consider the problem where our observed targets, yi, are drawn i.i.d. from a Gaussian distribution now with unknown mean and variance (\u00b5, \u03c32), which we seek to probabilistically estimate. We model this by placing a conjugate prior distribution on (\u00b5, \u03c32). If we assume our observations are drawn from a Gaussian, this leads to placing a Gaussian prior on our unknown mean and an Inverse-Gamma prior on our unknown variance:\n(y1, . . . , yN ) \u223c N (\u00b5, \u03c32) \u00b5 \u223c N (\u03b3, \u03c32\u03bb\u22121) \u03c32 \u223c \u0393\u22121(\u03b1, \u03b2).\nwhere \u0393(\u00b7) is the gamma function, m = (\u03b3, \u03bb, \u03b1, \u03b2), and \u03b3 \u2208 R, \u03bb > 0, \u03b1 > 0, \u03b2 > 0. Our aim is to estimate a posterior distribution q(\u00b5, \u03c32) = p(\u00b5, \u03c32|y1, . . . , yN ). To obtain an approximation for the true posterior, we assume that the estimated distribution can be factorized (Parisi, 1988) such that q(\u00b5, \u03c32) = q(\u00b5) q(\u03c32). Thus, our approximation takes the form of the Gaussian conjugate prior, the Normal Inverse-Gamma (N.I.G.) distribution:\np(\u00b5, \u03c32\ufe38 \ufe37\ufe37 \ufe38 \u03b8 | \u03b3, \u03bb, \u03b1, \u03b2\ufe38 \ufe37\ufe37 \ufe38 m\n) = \u03b2\u03b1 \u221a \u03bb\n\u0393(\u03b1) \u221a 2\u03c0\u03c32\n( 1\n\u03c32\n)\u03b1+1 exp { \u22122\u03b2 + \u03bb(\u03b3 \u2212 \u00b5) 2\n2\u03c32\n} . (3)\nA popular interpretation of the parameters of the conjugate prior distribution is in terms of \u201cvirtualobservations\u201d in support of a given property (Jordan, 2009). For example, the mean of a N.I.G. distribution can be interpreted as being estimated from \u03bb virtual-observations with sample mean \u03b3 while its variance was estimated from 2\u03b1 virtual-observations with sample mean \u03b3 and sum of squared deviations 2\u03b2. Following from this interpretation, we define the total evidence, \u03a6, of our evidential distributions as the sum of all inferred virtual-observations counts: (\u03a6 = \u03bb+ 2\u03b1).\nDrawing a sample \u03b8j from the N.I.G. distribution yields a single instance of our likelihood function, namely N (\u00b5j , \u03c32j ). Thus, the N.I.G. hyperparameters, (\u03b3, \u03bb, \u03b1, \u03b2), determine not only the location but also the dispersion concentrations, or uncertainty, associated with our inferred likelihood function. Therefore, we can interpret the N.I.G. distribution as higher-order, evidential, distribution on top of the unknown lower-order likelihood distribution from which observations are drawn.\nFor example, in Fig. 2A we visualize different evidential N.I.G. distributions with varying model parameters. We illustrate that by increasing the evidential parameters (i.e. \u03bb, \u03b1) of this distribution, the p.d.f. becomes tightly concentrated about its inferred likelihood function. Considering a single parameter realization of this higher-order distribution, cf. Fig. 2B, we can subsequently sample many lower-order realizations of our likelihood function, as shown in Fig. 2C.\nIn this work, we use neural networks to infer the hyperparameters of this higher-order, evidential distribution, given an input. This approach presents several distinct advantages compared to prior work. First, our method enables simultaneous learning of the desired regression task, along with aleatoric and epistemic uncertainty estimation, built in, by enforcing evidential priors. Second, since the evidential prior is a higher-order N.I.G. distribution, the maximum likelihood Gaussian can be computed analytically from the expected values of the (\u00b5, \u03c32) parameters, without the need for sampling. Third, we can effectively estimate the epistemic or model uncertainty associated with the network\u2019s prediction by simply evaluating the variance of our inferred evidential distribution. 3.2 LEARNING THE EVIDENTIAL DISTRIBUTION\nHaving formalized the use of an evidential distribution to capture both aleatoric and epistemic uncertainty, we next describe our approach for learning a model (c.f. Fig. 2D) to output the hyperparameters\nof this distribution. For clarity, we will structure the learning objective into two distinct parts: (1) acquiring or maximizing model evidence in support of our observations and (2) minimizing evidence or inflating uncertainty when the prediction is wrong. At a high level, we can think of (1) as a way of fitting our data to the evidential model while (2) enforces a prior to inflate our uncertainty estimates.\n(1) Maximizing the model fit.\nFrom Bayesian probability theory, the \u201cmodel evidence\u201d, or marginal likelihood, is defined as the likelihood of an observation, yi, given the evidential distribution parameters m and is computed by marginalizing over the likelihood parameters \u03b8:\np(yi|m) = p(yi|\u03b8,m)p(\u03b8|m)\np(\u03b8|yi,m) = \u222b \u03b8 p(yi|\u03b8,m)p(\u03b8|m) d\u03b8. (4)\nThe model evidence is not, in general, straightforward to evaluate since computing it involves integrating out the dependence on latent model parameters:\np(yi|m) = \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e p(yi|\u00b5, \u03c32)p(\u00b5, \u03c32|m) d\u00b5d\u03c32 (5)\nHowever, by placing a N.I.G. evidential prior on our Gaussian likelihood function an analytical solution for the model evidence does exist. For computational reasons, we minimize the negative logarithm of the model evidence (LNLLi (w)). For a complete derivation please refer to Sec. 7.1,\nLNLLi (w) = \u2212 log p(yi|m) = \u2212 log\n( 2 1 2+\u03b1\u03b2\u03b1 \u221a \u03bb\n2\u03c0(1 + \u03bb)\n( 2\u03b2 + \u03bb(\u03b3 \u2212 yi)2\n1 + \u03bb\n)\u2212 12\u2212\u03b1) . (6)\nInstead of modeling this loss using empirical Bayes, where the objective is to maximize model evidence, we alternatively can minimize the sum-of-squared (SOS) errors, between the evidential prior and the data that would be sampled from the associated likelihood. Thus, we define LSOSi (w) as\nLSOSi (w) = E\u03b8\u2032\u223cp(\u03b8|m) [ Ey\u2032\u223cp(y|\u03b8\u2032) [ ||y\u2032 \u2212 yi||22 ]] (7)\n= \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e Ey\u2032\u223cp(y|\u00b5,\u03c32) [ ||y\u2032 \u2212 yi||22 ] p(\u00b5, \u03c32|m) d\u00b5d\u03c32 (8)\n=\n( \u0393(\u03b1\u2212 12 )\n4 \u0393(\u03b1)\u03bb \u221a \u03b2\n)( 2\u03b2(1 + \u03bb) + (2\u03b1\u2212 1)\u03bb(yi \u2212 \u03b3)2 ) . (9)\nA step-by-step derivation is given in Sec. 7.1. In our experiments, using LSOSi (w) resulted in greater training stability and increased performance, compared to the LNLLi (w) loss. Therefore, LSOSi (w) is used in all presented results.\n(2) Minimizing evidence on errors.\nIn the first term of our objective above, we outlined a loss function for training a NN to output parameters of a N.I.G. distribution to fit our observations, either by maximizing the model evidence or minimizing the sum-of-squared errors. Now, we describe how to regularize training by applying a lack of evidence prior (i.e., maximum uncertainty). Therefore, during training we aim to minimize our evidence (or maximize our uncertainty) everywhere except where we have training data.\nThis can be done by minimizing the KL-divergence between the inferred posterior, q(\u03b8), and a prior, p(\u03b8). This has been demonstrated with success in the categorical setting where the uncertainty prior can be set to a uniform Dirichlet (Malinin & Gales, 2018; Sensoy et al., 2018). In the regression setting, the KL-divergence between our posterior and a N.I.G. zero evidence prior (i.e., {\u03b1, \u03bb} = 0) is not well defined (Soch & Allefeld, 2016), please refer to Sec. 7.2 for a derivation. Furthermore, this prior needs to be enforced specifically where there is no support from the data. Past works in classification accomplish this by using the ground truth likelihoood classification (i.e., the one-hot encoded labels) to remove the non-misleading evidence. However, in regression, labels are provided as point targets (not ground truth Gaussian likelihoods). Unlike classification, it is not possible to penalize evidence everywhere except our single point estimate, as this space is infinite and unbounded. Thus, these previously explored approaches for evidential optimization are not directly applicable.\nTo address both of these shortcomings of past works, now in the regression setting, we formulate a novel evidence regularizer, LRi , based on the error of the i-th prediction,\nLRi (w) = \u2016yi \u2212 E[\u00b5i]\u2016p \u00b7 \u03a6 = \u2016yi \u2212 \u03b3\u2016p \u00b7 (2\u03b1+ \u03bb), (10)\nwhere \u2016x\u2016p represents the L-p norm of x. The value of p impacts the penalty imposed on the evidence when a wrong prediction is made. For example, p = 2, heavily over-penalizes the evidence on larger errors, whereas p = 1 and p = 0.5 saturate the evidence penalty for larger errors. We found that p = 1 provided the optimal stability during training and use this value in all presented results.\nThis regularization loss imposes a penalty whenever there is an error in the prediction that scales with the total evidence of our inferred posterior. Conversely, large amounts of predicted evidence will not be penalized as long as the prediction is close to the target observation. We provide an ablation analysis to quantitatively demonstrate the added value of this evidential regularizer in Sec 7.3.2.\nThe combined loss function employed during training consists of the two loss terms for maximizing model evidence and regularizing evidence,\nLi(w) = LSOSi (w) + LRi (w). (11) 3.3 EVALUATING ALEATORIC AND EPISTEMIC UNCERTAINTY\nThe aleatoric uncertainty, also referred to as statistical or data uncertainty, is representative of unknowns that differ each time we run the same experiment. We evaluate the aleatoric uncertainty from E[\u03c32] = \u03b2\u03b1\u22121 . The epistemic, also known as the model uncertainty, describes the estimated uncertainty in the learned model and is defined as Var[\u00b5] = \u03b2(\u03b1\u22121)\u03bb . Note that Var[\u00b5] = E[\u03c3\n2]/\u03bb, which is expected as \u03bb is one of our two evidential virtual-observation counts. 4 EXPERIMENTS\n 4.1 PREDICTIVE ACCURACY AND UNCERTAINTY BENCHMARKING\nWe first qualitatively compare the performance of our approach against a set of benchmarks on a one-dimensional toy regression dataset (Fig. 3). For training and dataset details please refer to Sec. 7.3.1. We compare deterministic regression, as well as techniques using empirical variance of the networks\u2019 predictions such as MC-dropout, model-ensembles, and Bayes-byBackprop which underestimate the uncertainty outside the training distribution. In contrast, evidential regression estimates uncertainty appropriately and grows the uncertainty estimate with increasing distance from the training data.\nAdditionally, we compare our approach to stateof-the-art methods for predictive uncertainty estimation using NNs on common real world datasets used in (Hern\u00e1ndez-Lobato & Adams, 2015; Lakshminarayanan et al., 2017; Gal &\nGhahramani, 2016). We evaluate our proposed evidential regression method against model-ensembles and BBB based on root mean squared error (RMSE), and negative log-likelihood (NLL). We do not provide results for MC-dropout since it consistently performed inferior to the other baselines. The results in Table 1 indicate that although the loss function for evidential regression is more complex than competing approaches, it is the top performer in RMSE and NLL in 8 out of 9 datasets.\nFurthermore, we demonstrate that, on a synthetic dataset with a priori known noise, evidential models can additionally estimate and recover the underlying aleatoric uncertainty. For more information please refer to Sec. 7.3.3 for results and experiment details. 4.2 DEPTH ESTIMATION\nAfter establishing benchmark comparison results, in this subsection we demonstrate the scalability of our evidential learning by extending to the complex, high-dimensional task of depth estimation. Monocular end-to-end depth estimation is a central problem in computer vision which aims to learn a representation of depth directly from an RGB image of the scene. This is a challenging learning task since the output target y is very high-dimensional. For every pixel in the image, we regress over the desired depth and simultaneously estimate the uncertainty associated to that individual pixel.\nOur training data consists of over 27k RGB-to-depth pairs of indoor scenes (e.g. kitchen, bedroom, etc.) from the NYU Depth v2 dataset (Nathan Silberman & Fergus, 2012). We train a U-Net style NN (Ronneberger et al., 2015) for inference. The final layer of our model outputs a single H \u00d7W activation map in the case of deterministic regression, dropout, ensembling and BBB. Evidential models output four final activation maps, corresponding to (\u03b3, \u03bb, \u03b1, \u03b2).\nTable 2 summarizes the size and speed of all models. Evidential models contain significantly fewer trainable parameters than ensembles (where the number of parameters scales linearly with the size of the ensemble). BBB maintains a trainable mean and variance for every weight in the network, so its size is roughly 2\u00d7 larger as well. Since evidential regression models do not require sampling in order to estimate their uncertainty, their forward-pass inference times are also significantly more efficient. Finally, we demonstrate comparable predictive accuracy (through RMSE and NLL) to the other models. For a more detailed breakdown of how the number of samples effects the baselines please refer to Tab. 3. Note that the output size of the depth estimation problem presented significant learning challenges for the BBB baseline, and it was unable to converge during training. As a result, for the remainder of this analysis we compare against only spatial dropout and ensembles.\nWe evaluate these models in terms of their accuracy and their predictive uncertainty on unseen test data. Fig. 4A-C visualizes the predicted depth, absolute error from ground truth, and predictive uncertainty across three randomly picked test images. Ideally, a strong predictive uncertainty would capture any errors in the prediction (i.e., roughly correspond to where the model is making errors). Compared to dropout and ensembling, evidential uncertainty modeling captures the depth errors while providing clear and localized predictions of confidence. In general, dropout drastically underestimates the amount of uncertainty present, while ensembling occasionally overestimates the uncertainty.\nTo evaluate uncertainty calibration to the ground-truth errors, we fit receiver operating characteristic (ROC) curves to normalized estimates of error and uncertainty. Thus, we test the network\u2019s ability to detect how likely it is to make an error at a given pixel using its predictive uncertainty.\nROC curves take into account sensitivity and specificity of the uncertainties towards error predictions and are stronger if they contain greater area under their curve (AUC). Fig. 4D demonstrates that our evidential model provides uncertainty estimates concentrate to where the model is making the errors.\nIn addition to epistemic uncertainty, we also evaluate the aleatoric uncertainty estimates that are learned from our evidential models as well. Fig. 5 compares the evidential aleatoric uncertainty to those obtained by Gaussian likelihood optimization in several domains with high data uncertainty (mirror reflections and poor illumination). The results between both methods are in strong agreement, identifying mirror reflections and dark regions without visible geometry as sources of high uncertainty. 4.3 OUT-OF DISTRIBUTION TESTING\nA key use of uncertainty estimation is to understand when a model is faced with test samples that fall out-of-distribution (OOD) or when the model\u2019s output cannot be trusted. In the previous subsection, we showed that our evidential uncertainties were well calibrated with the model\u2019s errors. In this subsection, we investigate the performance on out-ofdistribution samples. Fig. 6 illustrates predicted depth on various test input images (left) and outside (right) of the original distribution. All images have not been seen by the model during training. We qualitatively and quantitatively demonstrate that the epistemic uncertainty predicted by our evidential model consistently increases on the OOD samples. 4.3.1 ROBUSTNESS TO ADVERSARIAL SAMPLES\nNext, we consider the extreme case of OOD detection where the inputs are adversarially perturbed to inflict maximum error on the model. We compute adversarial perturbations to our test set using the fast gradient sign method (Goodfellow et al., 2014), with increasing scales, , of noise. Fig. 7A\nconfirms that the absolute error of all methods increasing as adversarial noise is added. We also observe a positive effect noise on our predictive uncertainty estimates in Fig. 7B. An additional desirable property of evidential uncertainty modeling is that it presents a higher overall uncertainty when presented with adversarial inputs compared to dropout and ensembling methods. Furthermore, we observe this strong overall uncertainty estimation despite the model losing calibration accuracy from the adversarial examples (Fig. 7C).\nThe robustness of evidential uncertainty against adversarial perturbations is visualized in greater detail in Fig. 7D, which illustrates the predicted depth, error, and estimated pixel-wise uncertainty as we perturb the input image with greater amounts of noise (left-to-right). Note that the predictive uncertainty not only steadily increases as we increase the noise, but the spatial concentrations of uncertainty throughout the image maintain tight correspondence with the error. 6 CONCLUSION\nIn this paper, we develop a novel method for training deterministic NNs that both estimates a desired target and evaluates the evidence in support of the target to generate robust metrics of model uncertainty. We formalize this in terms of learning evidential distributions, and achieve stable training by penalizing our model for prediction errors that scale with the available evidence. Our approach for evidential regression is validated on a benchmark regression task. We further demonstrate that this method robustly scales to a key task in computer vision, depth estimation, and that the predictive uncertainty increases with increasing out-of-distribution adversarial perturbation. This framework for evidential representation learning provides a means to achieve the precise uncertainty metrics required for robust neural network deployment in safety-critical domains. 7 APPENDIX\n 7.1 MODEL EVIDENCE DERIVATIONS\nFor convenience, define \u03c4 = 1/\u03c32 be the precision of a Gaussian distribution. The change of variables transforms the Normal Inverse-Gamma distribution p(\u00b5, \u03c32|\u03b3, \u03bb, \u03b1, \u03b2) to the equivalent Normal Gamma distribution p(\u00b5, \u03c4 |\u03b3, \u03bb, \u03b1, \u03b2), parameterized by precision \u03c4 \u2208 (0,\u221e) instead of variance \u03c32,\np(\u00b5, \u03c4 |\u03b3, \u03bb, \u03b1, \u03b2) = \u03b2 \u03b1 \u221a \u03bb\n\u0393(\u03b1) \u221a 2\u03c0 \u03c4\u03b1\u2212 1 2 e\u2212\u03b2\u03c4e\u2212\n\u03bb\u03c4(\u00b5\u2212\u03b3)2 2 . (12) 7.1.1 TYPE II MAXIMUM LIKELIHOOD LOSS\nMarginalizing out \u00b5 and \u03c4 gives the result of equation 5, p(yi|m) = \u222b \u03c4 \u222b \u00b5 p(yi|\u00b5, \u03c4) p(\u00b5, \u03c4 |\u03b3, \u03bb, \u03b1, \u03b2) d\u00b5d\u03c4 (13)\n= \u222b \u221e \u03c4=0 \u222b \u221e \u00b5=\u2212\u221e [\u221a \u03c4 2\u03c0 e\u2212 \u03c4 2 (yi\u2212\u00b5) 2 ] [ \u03b2\u03b1 \u221a \u03bb \u0393(\u03b1) \u221a 2\u03c0 \u03c4\u03b1\u2212 1 2 e\u2212\u03b2\u03c4e\u2212 \u03bb\u03c4(\u00b5\u2212\u03b3)2 2 ] d\u00b5d\u03c4 (14)\n= \u222b \u221e \u03c4=0 (\u03b2\u03c4)\u03b1 \u0393(\u03b1) \u221a \u03bb 2\u03c0\u03c4(1 + \u03bb) e\u2212\u03b2\u03c4e\u2212 \u03c4\u03bb(\u03b3\u2212yi) 2 2(1\u2212\u03bb) d\u03c4 (15)\n= 2 1 2+\u03b1\u03b2\u03b1\n\u221a \u03bb\n2\u03c0(1 + \u03bb)\n( 2\u03b2 + \u03bb(\u03b3 \u2212 yi)2\n1 + \u03bb\n)\u2212 12\u2212\u03b1 . (16)\nFor computational reasons it is common to instead minimize the negative logarithm of the model evidence.\nLNLLi (w) = \u2212 log p(yi|m) = \u2212 log\n( 2 1 2+\u03b1\u03b2\u03b1 \u221a \u03bb\n2\u03c0(1 + \u03bb)\n( 2\u03b2 + \u03bb(\u03b3 \u2212 yi)2\n1 + \u03bb\n)\u2212 12\u2212\u03b1) (17) 7.1.2 SUM OF SQUARES LOSS\nSimilarly, we can marignalize out \u00b5 and \u03c32 to receive the result of equation 8, LSOSi (w) = \u222b \u03c32 \u222b \u00b5 Ey\u223cp(y|\u00b5,\u03c32) [ ||yi \u2212 y||22 ] p(\u00b5, \u03c32|\u03b3, \u03bb, \u03b1, \u03b2) d\u00b5d\u03c32 (18)\n= \u222b \u03c32 \u222b \u00b5 \u222b y ||yi \u2212 y||22 p(y|\u00b5, \u03c32) p(\u00b5, \u03c32|\u03b3, \u03bb, \u03b1, \u03b2) dy d\u00b5d\u03c32 (19)\n= \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e \u222b \u221e y=\u2212\u221e ||yi \u2212 y||22 [\u221a 1 2\u03c0\u03c32 e\u2212 (y\u2212\u00b5)2 2\u03c32 ] [\n\u03b2\u03b1 \u221a \u03bb\n\u0393(\u03b1) \u221a 2\u03c0\n( 1\n\u03c32\n)\u03b1+ 32 e\u2212 \u03b2 \u03c32 e\u2212 \u03bb(\u00b5\u2212\u03b3)2 2\u03c32 ] dy d\u00b5d\u03c32 (20)\n= \u222b \u221e \u03c32=0 \u222b \u221e \u00b5=\u2212\u221e [ (yi \u2212 \u00b5)2 + \u03c32 ] [ \u03b2\u03b1\u221a\u03bb \u0393(\u03b1) \u221a 2\u03c0 ( 1 \u03c32 )\u03b1+ 32 e\u2212 \u03b2 \u03c32 e\u2212 \u03bb(\u00b5\u2212\u03b3)2 2\u03c32 ] d\u00b5d\u03c32\n(21)\n= \u222b \u221e \u03c32=0 \u03b2\u03b1 \u03bb\u0393(\u03b1) exp ( \u2212\u03b2/\u03c32 ) \u03c3\u22122(\u03b1+1) ( \u03c32(1 + \u03bb) + \u03bb(yi \u2212 \u03b3)2 ) d\u03c32 (22)\n=\n( \u0393(\u03b1\u2212 12 )\n4 \u0393(\u03b1)\u03bb \u221a \u03b2\n)( 2\u03b2(1 + \u03bb) + (2\u03b1\u2212 1)\u03bb(yi \u2212 \u03b3)2 ) (23) 7.2 KL-DIVERGENCE OF THE NORMAL INVERSE-GAMMA\nThe KL-divergence between two Normal Inverse-Gamma functions is given by (Soch & Allefeld, 2016):\nKL(p(\u00b5, \u03c32|\u03b31, \u03bb1, \u03b11, \u03b21)||p(\u00b5, \u03c32|\u03b32, \u03bb2, \u03b12, \u03b22) (24)\n= 1\n2 \u03b11 \u03b21 (\u00b51 \u2212 \u00b52)2\u03bb2 + 1 2 \u03bb2 \u03bb1 \u2212 1 2 + \u03b12 log ( \u03b21 \u03b22 ) \u2212 log ( \u0393(\u03b11) \u0393(\u03b12) ) (25)\n+ (\u03b11 \u2212 \u03b12)\u03a8(\u03b11)\u2212 (\u03b21 \u2212 \u03b22) \u03b11 \u03b21\n(26)\n\u0393(\u00b7) is the Gamma function and \u03a8(\u00b7) is the Digamma function. The evidence is defined by (2\u03b1+ \u03bb). For zero evidence, both \u03b1 = 0 and \u03bb = 0. To compute the KL divergence between one N.I.G distribution and another with zero evidence we can set either {\u03b12, \u03bb2} = 0 (i.e., forward-KL) in which case, \u0393(0) is not well defined, or {\u03b11, \u03bb1} = 0 (i.e. reverse-KL) which causes a divide-by-zero error of \u03bb1. In either approach, the KL-divergence between an arbitrary N.I.G and one with zero evidence can not be evaluated. 7.3 BENCHMARK REGRESSION TASK EVALUATIONS\n 7.3.1 EPISTEMIC UNCERTAINTY ESTIMATION\nThe training set consists of training examples drawn from y = sin(3x)/(3x) + , where \u223c N (0, 0.02) in the region \u22123 \u2264 x \u2264 3, whereas the test data is unbounded. All models consisted of 100 neurons with 3 hidden layers and were trained to convergence. The data presented in Fig. 3 illustrates the estimated epistemic uncertainty and predicted mean accross the entire test set, \u22123 \u2264 x \u2264 3. 7.3.2 IMPACT OF THE EVIDENTIAL REGULARIZER\nIn the following experiment, we demonstrate the importance of augmenting the training objective with our evidential regularizer LR as introduced in Sec. 3.2. Fig. 8 provides quantitative results on training the same regression problem presented in 7.3.1 with and without this evidential regularization term. This term introduces an \u201cuncertain\u201d prior into our learning process so out-of-distribution (OOD)\nsamples exhibit high epistemic uncertainty. Without the use of this novel loss term, the learned epistemic uncertainty is unreliable on OOD data. 7.3.3 ALEATORIC UNCERTAINTY ESTIMATION\nThe training set consists of training examples drawn from y = sin(3x)/(3x) + (x), where (x) \u223c N (0, s(x)), and s(x) = 120 cos(3.3x) + 0.1. We evaluate against (Kendall & Gal, 2017) which presents an algorithm for heteroscedastic aleatoric uncertainty estimation by inferring the mean and variance of a Gaussian likelihood function. As presented in the paper, training is done by minimizing the negative log-likelihood of the data given the inferred likelihood parameters. Both our network and the baseline Gaussian NLL network consisted of 100 neurons with 3 hidden layers and were trained to convergence. 7.4 ADDITIONAL DEPTH ESTIMATION PERFORMANCE RESULTS\n", "source": "paper_review", "evaluation": "human"}
